Jonathan_ho Video Diffusion Models 2022

[TOC]

Summary of paper

proposing a diffusion model for video generation that shows very promising initial results

this is the extension of image diffusion model
they introduce a new conditional sampling technique for spatial and temporal video extension that performs better.

Diffusion model

A diffusion model specified in continuous time is a generative model with latents

Training diffusion model

learning to reverse the forward process for generation can be reduced to learning to denoise $z_t \sim q(z_t|x)$ into an estimate $\hat x_\theta (z_t, \lambda_t) \approx x$ for all $t$ (we will drop the dependence on $\lambda_t$) to simplify notation. We train this denoising model $\hat x_\theta$ using a weighted MSE loss
this reduction of generation to denoising can be justified as optimising a weighted variational lower bound on the data log likelihood under the diffusion model.

Effective sampling with the new method for conditional generation – predictor-corrector sampler

in the conditional generation setting, the data $x$ is equipped with a conditional signal $c$, which may represent a text caption. To train a diffusion model to fit $p(x|c)$, the only modification that needs to be made is to provide $c$ to the model as $\hat x_\theta (z_t, c)$

Improvements to sample quality by classifier-free guidance

Architecture and condition

the standard diffusion model is a U-Net
- which is a neural network architecture constructed as a spatial downsampling pass followed by a spatial upsampling pass with skip connections to the downsampling pass activations.
The network is built from layer of 2D convolutional residual blocks, and each ConV block is followed by a spatial attention block.
- Conditioning information, such as $c$, is provided to the network in the form of an embedding vector (sentence embedding https://huggingface.co/sentence-transformers/clip-ViT-L-14), added into each residual block (they find it helpful to process these embedding vectors using several MLP layers before adding)
For video data, we use a particular type of 3D U-Net that is factorised over space and time.
- space-only 3D convolution
  - for instance, we change each 3x3 convolution into a 1x3x3 convolution (the first axis indexes video frames, the second and third index the spatial height and width)
- the attention in each spatial attention block remains as attention over space. (i.e., the first axis is treated as a batch axis)
- after each spatial attention block, we further insert a temporal attention block that perform attention over the first axis and treats the spatial axes as batch axes.
- this separation is good for computational efficiency

Text-conditioned video generation

hyperparameters

Maybe we can also use this training method and architecture to pretrain our image-action-text multimodal model

we can combine this with the latent diffusion model to increase computational efficiency.