Rectified Flow Toy

Rectified Flow is a technique for improving generative models such as image generators. Below is an interactive visualization of how it works.

Consider building a generative model for images of cats. A naive way to do this is to make a list of pairs (noise_image, cat_image) and train a deep neural net in the usual supervised way. You generate cat images by constructing a random noise image and plugging it into the model.

This doesn't work very well because, in addition to learning the distribution of cat images, it has to learn the completely arbitrary way they are mapped to noise images. Since the noise images are random anyway, what if we could choose them differently so they mapped more naturally to cat images? Rectified flow is a procedure for doing that.

Some notation: Let's call the the noise images xix_i and the cat images yiy_i. The subscript ii indexes into the training set of images. We start by choosing random xi0x_i^0s, which we'll replace with better ones xi1x_i^1 etc.

Since we've made xx and yy belong to the same vector space (such as 256x256x3 RGB images), we can talk about the difference xiyix_i - y_i and about the path between them. We can interpolate points between them, such as txi+(1t)yit x_i + (1-t) y_i.

The procedure starts by training a temporary model to follow the reverse path from yiy_i to xi0x^0_i. We can do this by training on the mapping from random interpolations to the difference vector:

txi0+(1t)yixi0yi   t[0,1] t x_i^0 + (1-t) y_i \to x_i^0 - y_i ~~~ \forall t \in [0,1]

We can now use this network in a differential equation solver to trace the path from yiy_i to xix_i, by following the flow for 11 unit. It would exactly reproduce the path if there's just one image, but when there are many images, we should expect paths to nearly intersect. When this happens, both retraced paths will deflect and end at different points. We call these xi1x_i^1.

The new set of points xi1x_i^1 will have a similar distribution to the original xi0x_i^0 (ie, uniform random), but it is easier for a model to learn the xi1yix_i^1 \to y_i mapping. Choosing a random noise image and plugging it into this model should generate a better cat image than a model trained on xi0yix_i^0 \to y_i.

Here's an interactive visualization. You can drag the image samples yiy_i and original noise samples xi0x^0_i around. The gray lines link corresponding pairs. Notice that they often cross. The intermediate points show the process of following the flow, leading to the new random samples xi1x^1_i. You would then train the final model to map xi1x^1_i to yiy_i.

WebGPU not supported

Notes

Further reading

An Introduction to Flow Matching

The paradox of diffusion distillation Perspectives on diffusion