[TOC]

  1. Title: Google Video Diffusion Models
  2. Author: Jonathan Ho et. al.
  3. Publish Year: 22 Jun 2022
  4. Review Date: Thu, Sep 22, 2022

Summary of paper

Motivation

  • proposing a diffusion model for video generation that shows very promising initial results

Contribution

  • this is the extension of image diffusion model
  • they introduce a new conditional sampling technique for spatial and temporal video extension that performs better.

Some key terms

Diffusion model

  • A diffusion model specified in continuous time is a generative model with latents
  • image-20220923173758941

Training diffusion model

  • learning to reverse the forward process for generation can be reduced to learning to denoise $z_t \sim q(z_t|x)$ into an estimate $\hat x_\theta (z_t, \lambda_t) \approx x$ for all $t$ (we will drop the dependence on $\lambda_t$) to simplify notation. We train this denoising model $\hat x_\theta$ using a weighted MSE loss
  • image-20220923174127695
  • this reduction of generation to denoising can be justified as optimising a weighted variational lower bound on the data log likelihood under the diffusion model.

Effective sampling with the new method for conditional generation – predictor-corrector sampler

image-20220923175623526

  • in the conditional generation setting, the data $x$ is equipped with a conditional signal $c$, which may represent a text caption. To train a diffusion model to fit $p(x|c)$, the only modification that needs to be made is to provide $c$ to the model as $\hat x_\theta (z_t, c)$

Improvements to sample quality by classifier-free guidance

image-20220923175937535

Video diffusion model

Architecture and condition

  • the standard diffusion model is a U-Net
    • which is a neural network architecture constructed as a spatial downsampling pass followed by a spatial upsampling pass with skip connections to the downsampling pass activations.
  • The network is built from layer of 2D convolutional residual blocks, and each ConV block is followed by a spatial attention block.
    • Conditioning information, such as $c$, is provided to the network in the form of an embedding vector (sentence embedding https://huggingface.co/sentence-transformers/clip-ViT-L-14), added into each residual block (they find it helpful to process these embedding vectors using several MLP layers before adding)
  • For video data, we use a particular type of 3D U-Net that is factorised over space and time.
    • space-only 3D convolution
      • for instance, we change each 3x3 convolution into a 1x3x3 convolution (the first axis indexes video frames, the second and third index the spatial height and width)
    • the attention in each spatial attention block remains as attention over space. (i.e., the first axis is treated as a batch axis)
    • after each spatial attention block, we further insert a temporal attention block that perform attention over the first axis and treats the spatial axes as batch axes.
    • this separation is good for computational efficiency

image-20220924203903972

Text-conditioned video generation

hyperparameters

image-20220924205011806

Potential future work

Maybe we can also use this training method and architecture to pretrain our image-action-text multimodal model

we can combine this with the latent diffusion model to increase computational efficiency.