[TOC]

  1. Title: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
  2. Author: Alex Nichol et. al.
  3. Publish Year: Dec 2021
  4. Review Date: Jan 2022

Summary of paper

In author’s previous work, the diffusion model can achieve photorealism in the class-conditional setting by augmenting with classifier guidance, a technique which allows diffusion models to condition on a classifier’s labels.

  • The classifier is first trained on noised images, and during the diffusion sampling process, gradients from the classifier are used to guide the output sample towards the label.

classifier details

  • once you have class label, I add additional probability distribution in the formula to guide the model
    • image-20220113003620491
    • image-20220113004021624

Algorithm of previous work (classifier guided sampling)

image-20220113004351499

So this work extend label classifier to Natural language

  • by encoding texts into embeddings and feed into each diffusion layer

Some key terms

diffusion model

diffusion models sample from a distribution by reversing a gradual noising process

  • sampling starts with noise x_T and produce gradually less noisy samples x_T-1, x_T-2, .. until reach a final sample x_0.
  • each timestep t corresponds to a certain noise level. and x_t can be thought of as a mixture of a signal x_0 with some noise e where the signal to noise ratio is determined by timestep t.
  • a diffusion model learns to produce a slightly more “denoised” x_t-1 from x_t
  • image-20220112175714821

Theory behind diffusion model

  • we can add noise again and again and eventually we can get random noise image
  • image-20220112184338602
  • then we try to invert it.
  • image-20220112184627209

Improvements

  1. Rather than model the denoised image’s mean, it predict the noise itself.
  2. fix the covariance of the gaussian distribution is good enough
    1. further more, learning a interpolation covariance parameter is even better than fixed one
      1. image-20220113001238307
  3. give different weight to gradient at different step t (the first few steps are much more important)
    1. image-20220113002014975

Good things about the paper (one paragraph)

Major comments

Minor comments

Check the paper reading video:

Incomprehension

Potential future work

“text to image” task has some similarity with “text to goal state/desired action” task for Text-based RL environment.

diffusion model and denoising process share some similarity with the “training with masking” idea.