In recent years, generative models called “diffusion models” have become increasingly popular, and with good cause.
The world has seen what diffusion models are capable of, such as outperforming GANs on picture synthesis, thanks to a select few landmark publications published just in the 2020s & 2021s.
Practitioners most recently saw the usage of diffusion models in DALL-E 2, OpenAI’s image creation model that was published last month.
Many Machine Learning practitioners are undoubtedly curious about the inner workings of Diffusion Models given their recent surge of success.
In this post, we’ll look at the theoretical underpinnings of Diffusion Models, their design, their advantages, and much more. Let’s get going.
What is the Diffusion model?
Let’s start by figuring out why this model is referred to as a diffusion model.
A word related to thermodynamics in physics classes is called diffusion. A system is not in equilibrium if there is a large concentration of a material, like a scent, in one location.
Diffusion must occur for the system to enter equilibrium. The molecules of the scent diffuse throughout the system from a region of a higher concentration, making the system uniform throughout.
Everything eventually becomes homogeneous due to diffusion.
Diffusion models are motivated by this thermodynamic non-equilibrium condition. Diffusion models use a Markov chain, which is a series of variables where each variable’s value relies on the state of the prior event.
Taking a picture, we successively add a particular amount of noise to it throughout the forward diffusion phase.
After storing the noisier image, we proceed to create the subsequent image in the series by introducing additional noise.
Several times, this procedure is done. A pure noise picture results from repeating this method a few times.
How then can we create a picture from this cluttered image?
The diffusion process is reversed using a neural network. The same networks and the same weights are used in the backward diffusion process to create the picture from t to t-1.
Instead of letting the network anticipate the picture, one can attempt to predict the noise at each step, which has to be removed from the image, in order to further simplify the task.
In any scenario, the neural network design must be selected in a way that maintains data dimensionality.
Deep Dive into Diffusion Model
The components of a diffusion model are a forward process (also known as a diffusion process), in which a datum (often an image) is gradually noised, and a reverse process (also known as a reverse diffusion process), in which noise is converted back into a sample from the target distribution.
When the noise level is low enough, conditional Gaussians can be used to establish the sampling chain transitions in the forward process. An easy parameterization of the forward process results from coupling this knowledge with the Markov assumption:
q(x1:T |x0) := Y T t=1 q(xt|xt−1), q(xt|xt−1) := N (xt; p 1 − βtxt−1, βtI)
Here 1….T is a variance schedule (either learned or fixed) that assures, for sufficiently high T, that xT is virtually an isotropic Gaussian.
The opposite process is where diffusion model magic happens. The model learns to reverse this diffusion process during training in order to produce fresh data. The model learns the joint distribution as (x0:T) the result of starting with the pure Gaussian noise equation
(xT):=N(xT,0,I).
pθ(x0:T ) := p(xT ) Y T t=1 pθ(xt−1|xt), pθ(xt−1|xt) := N (xt−1; µθ (xt, t), Σθ(xt, t))
where the Gaussian transitions’ time-dependent parameters are discovered. In particular, take note of how the Markov formulation states that a given reverse diffusion transition distribution depends exclusively on the prior timestep (or subsequent timestep, depending on how you look at it):
pθ(xt−1|xt) := N (xt−1; µθ (xt, t), Σθ(xt, t))
Model Training
A reverse Markov model that maximizes the probability of the training data is used to train a diffusion model. Practically speaking, training is analogous to reducing the variational upper bound on the negative log probability.
E [− log pθ(x0)] ≤ Eq − log pθ(x0:T ) q(x1:T |x0) = Eq − log p(xT ) − X t≥1 log pθ(xt−1|xt) q(xt|xt−1) =: L
Models
We now need to decide how to execute our Diffusion Model after establishing the mathematical underpinnings of our goal function. The sole decision needed for the forward process is determining the variance schedule, whose values typically rise during the procedure.
We strongly consider using the Gaussian distribution parameterization and model architecture for the reverse procedure.
The sole condition of our design is that both the input and the output have the same dimensions. This underlines the enormous degree of freedom that Diffusion Models provide.
Below, we’ll go into greater depth about these options.
Forward Process
We must provide the variance schedule in relation to the forward process. We specifically set them to be time-dependent constants and disregarded the possibility that they can be learned. A chronological schedule from
β1 = 10−4 to βT = 0.02.
Lt becomes a constant with respect to our set of learnable parameters due to the fixed variance schedule, allowing us to disregard it during training regardless of the specific values selected.
Reverse Process
We now go over the decisions needed to define the reverse process. Remember how we described the reverse Markov transitions as Gaussian:
pθ(xt−1|xt) := N (xt−1; µθ (xt, t), Σθ(xt, t))
Now that we have identified the functional types. Despite the fact that there are more intricate techniques to parameterize, we just set
Σθ(xt, t) = σ 2 t I
σ 2 t = βt
To put it another way, we consider the multivariate Gaussian to be the result of separate Gaussians with the same variance, a variance value that can fluctuate over time. These deviations are set to match the timetable of forwarding process deviations.
As a result of this new formulation, we have:
pθ(xt−1|xt) := N (xt−1; µθ (xt, t), Σθ(xt, t)) :=N (xt−1; µθ (xt, t), σ2 t I)
This results in the alternate loss function shown below, which the authors found to produce more consistent training and superior outcomes:
Lsimple(θ) := Et,x0, h − θ( √ α¯tx0 + √ 1 − α¯t, t) 2
The authors also draw connections between this formulation of diffusion models and Langevin-based score-matching generative models. As with the independent and parallel development of wave-based quantum physics and matrix-based quantum mechanics, which revealed two comparable formulations of the same phenomena, it appears that Diffusion Models and Score-Based models can be two sides of the same coin.
Network Architecture
Despite the fact that our condensed loss function aims to train a model Σθ, we still haven’t decided on this model’s architecture. Keep in mind that the model simply has to have the same input and output dimensions.
Given this constraint, it is probably not unexpected that U-Net-like architectures are frequently used to create picture diffusion models.
Numerous changes are made along the route of the reverse process while using continuous conditional Gaussian distributions. Remember that the goal of the reverse procedure is to create a picture made up of integer pixel values. Determining discrete (log) likelihoods for each potential pixel value over all pixels is therefore necessary.
This is accomplished by assigning a separate discrete decoder to the reverse diffusion chain’s last transition. estimating the chance of a certain image x0 given x1.
pθ(x0|x1) = Y D i=1 Z δ+(x i 0 ) δ−(x i 0 ) N (x; µ i θ (x1, 1), σ2 1 ) dx
δ+(x) = ∞ if x = 1 x + 1 255 if x < 1 δ−(x) = −∞ if x = −1 x − 1 255 if x > −1
where the superscript I denotes the extraction of one coordinate and D denotes the number of dimensions in the data.
The objective at this point is to establish the likelihood of each integer value for a specific pixel given the distribution of potential values for that pixel in the time-varying t=1.
Final Objective
The greatest outcomes, according to scientists, came from forecasting the noise component of a picture at a certain timestep. In the end, they employ the following goal:
Lsimple(θ) := Et,x0, h − θ( √ α¯tx0 + √ 1 − α¯t, t) 2
In the following image, the training and sampling procedures for our diffusion model are concisely depicted:
Benefits of Diffusion Model
As was already indicated, the amount of research on diffusion models has multiplied recently. Diffusion Models now deliver State-of-the-Art image quality and are inspired by non-equilibrium thermodynamics.
Diffusion Models provide a variety of other advantages in addition to having cutting-edge picture quality, such as not requiring adversarial training.
The drawbacks of adversarial training are widely known, hence it is often preferable to choose non-adversarial alternatives with equivalent performance and training effectiveness.
Diffusion models also provide the advantages of scalability and parallelizability in terms of training effectiveness.
Although Diffusion Models appear to generate outcomes seemingly out of thin air, the basis for these results is laid by a number of thoughtful and interesting mathematical decisions and subtleties, and industry best practices are still being developed.
Conclusion
In conclusion, researchers demonstrate high-quality picture synthesis findings utilizing diffusion probabilistic models, a class of latent variable models motivated by ideas from nonequilibrium thermodynamics.
They have achieved tremendous things thanks to their State-of-the-Art outcomes and non-adversarial training and given their infancy, more advancements may be anticipated in the years to come.
Particularly, it has been discovered that diffusion models are crucial to the functionality of advanced models like DALL-E 2.
Here you can access the complete research.
Leave a Reply