Sort of a preamble
As we write this – it’s April 2023 – it’s hard to overstate the attention, hopes and fears surrounding deep learning-based image and text generation. The implications for society, politics, and human well-being deserve more than a short, docile paragraph. We therefore defer the appropriate treatment of this subject to specialized publications and would like to say one thing: The more you know, the better; the less impressed you are by the overly simplistic, context-neglected statements of public figures; the easier it will be for you to take your own stand on the subject. That means we’re getting started.
In this paper, we present R torch
implementation Implicit models of diffusion denoising (J. Song, Meng and Ermon (2020)). The code is on GitHub and comes with an extensive README file that details everything from the mathematical background to implementation choices and code organization to model training and sample generation. Here we provide a high-level overview, placing the algorithm in the broader context of generative deep learning. Any details you are specifically interested in, please refer to the README!
Diffusion Models in Context: Generative Deep Learning
In generative deep learning, models are trained to generate new examples that are likely to come from some known distribution: the distribution of landscape images, say, or Polish verses. While diffusion is now all the hype, other approaches or families of approaches have received a lot of attention in the last decade. Let’s quickly list some of the most discussed and provide a quick description.
First, diffusion models themselves. Diffusion, a general term, refers to entities (such as molecules) that spread from areas of higher concentration to areas of lower concentration, thereby increasing entropy. In other words, information is lost. In diffusion models, this loss of information is intentional: In a “forward” process, a sample is taken and gradually transformed into (usually Gaussian) noise. The “reverse” process is then to take the noise instance and gradually denoise it until it looks like it came from the original distribution. But surely we can’t reverse the arrow of time? No, and this is where deep learning comes in: During the forward process, the network learns what needs to be done to “reverse”.
A completely different idea underlies what happens in GANs, Generative adversarial networks. In a GAN, we have two agents in the game, each trying to outwit the other. One tries to generate samples that look as realistic as possible; the other puts his energy into detecting fakes. Ideally, both will improve over time and result in the desired performance (and also the “regulator”, which is not bad, but always a step behind).
Then there are UAE: Variational automatic encoders. In VAE, as in GAN, there are two networks (this time encoder and decoder). However, instead of each trying to minimize its own cost function, training is subject to a single—albeit compound—loss. One component ensures that the reconstructed samples closely resemble the input; the second, that the latent code confirms predetermined limitations.
Finally, let’s mention flows through (although these tend to be used for a different purpose, see next section). A stream is a sequence of differentiable, irreversible mappings from data to some “nice” distribution, meaning “something we can easily sample or get a probability from”. For flows, as for diffusion, learning occurs during the forward phase. Invertibility, like differentiability, then ensures that we can return to the distribution of inputs we started with.
Before we dive into diffusion, we outline – very informally – some aspects to consider when mentally mapping the space of generative models.
Generative Models: If you wanted to draw a mind map…
Above I have given rather technical characteristics of the different approaches: What is the overall setup, what are we optimizing for… Staying on the technical side, we could look at established categorizations such as probabilistic vs. non-probability models. Likelihood-based models directly parameterize the data distribution; the parameters are then fitted by maximizing the likelihood of the data according to the model. Of the above architectures, this is the case with VAE and streams; it is not with GAN.
But we can also look at another perspective – from the point of view of purpose. First, are we interested in learning representation? That is, would we like to condense the sample space into a sparser one, one that reveals essential features and hints at useful categorization? If so, VAEs are classic candidates to look at.
Alternatively, are we mainly interested in generation and would like to synthesize samples corresponding to different levels of coarseness? Then diffusion algorithms are a good choice. It was proved, that
(…) representations obtained using different levels of noise tend to correspond to different feature scales: the higher the noise level, the larger the scale of features captured.
As a final example, what if we are not interested in synthesis, but want to assess whether a given piece of data is likely to be part of some distribution? If so, streams may be an option.
Approximation: Diffusion models
Like any deep learning architecture, diffusion models form a heterogeneous family. Here are just some of the most popular members.
When we said above that the idea of diffusion models was to gradually transform the input to noise and then gradually denoise it again, we left open how this transformation is operationalized. This is actually one of the areas where competing approaches tend to differ.
Y. Song et al. (2020)for example, use a stochastic differential equation (SDE) that maintains the desired distribution during the information-destroying forward phase. In sharp contrast, other approaches, inspired by Ho, Jain and Abbeel (2020), rely on Markov chains to realize state transitions. The variant presented here – J. Song, Meng and Ermon (2020) – keeps the same spirit but improves efficiency.
Our implementation – an overview
The README provides a very thorough introduction that covers (almost) everything from theoretical background to implementation details to training and debugging. Here we just outline a few basic facts.
As indicated above, all work is done in the forward phase. The network has two inputs, images as well as signal-to-noise ratio information, which are used in each step of the corruption process. This information can be encoded in various ways and then embedded in some form in a higher dimensional space more conducive to learning. This is what it might look like for two different types of scheduling/insertion:
In terms of architecture, both inputs and intended outputs are images, U-Net is the main driver. It forms part of a top-level model that creates corrupted versions of the desired noise level for each input image and runs U-Net on them. From what is returned, it tries to infer the level of noise that was driving each instance. Training then consists in improving these estimates.
Trained model, the opposite process – image generation – is straightforward: It consists of recursive noise removal according to a (known) noise rate schedule. Overall, the whole process could then look like this:
In conclusion, this post in itself is really just an invitation. Check out the GitHub repository to learn more. If you need more motivation to do this, here are some pictures of flowers.
Thank you for reading!
Dieleman, Sander. 2022. “Diffusion models are autoencoders.” https://benanne.github.io/2022/01/31/diffusion.html.
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “probabilistic models of diffusion denoising.” https://doi.org/10.48550/ARXIV.2006.11239.
Song, Jiaming, Chenlin Meng, and Stefano Ermon. 2020. “Denoising Implicit Diffusion Models.” https://doi.org/10.48550/ARXIV.2010.02502.
Song, Yang, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020. “Score-Based Generative Modeling via Stochastic Differential Equations.” CoRR abs/2011.13456. https://arxiv.org/abs/2011.13456.