[TOC]

  1. Title: Planning With Diffusion for Flexible Behaviour Synthesis
  2. Author: Michael Janner et. al.
  3. Publish Year: 21 Dec 2022
  4. Review Date: Mon, Jan 30, 2023

Summary of paper

image-20230130134456046

Motivation

  • use the diffusion model to learn the dynamics
  • tight coupling of the modelling and planning
  • our goal is to break this abstraction barrier by designing a model and planning algorithm that are trained alongside one another, resulting in a non-autoregressive trajectory-level model for which sampling and planning are nearly identical.

Some key terms

ideal model-based RL

image-20230130134558295

image-20230130134626960

  • image-20230130163538769

Why neural nets + trajectory optimisation is a headache

  • Long-horizon predictions are unreliable
  • Optimizing for reward with neural net models produces adversarial examples in trajectory spaces

A generative model of trajectories

  • image-20230130164314409

What is autoregressive model

  • it predicts future values based on past values.

Non-autoregressive prediction

  • prediction is non-autoregressive: entire trajectory is predicted simultaneously
  • image-20230130170133371

Sampling from diffuser

  • sampling occurs by iteratively refining randomly-initialised trajectories

Flexible behaviour synthesis through distribution composition

  • image-20230130170732252
  • guidance functions transforms an unconditional trajectory model into a conditional policy for diverse tasks.
  • image-20230130171354547

Goal planning through inpainting

  • image-20230130171524518
  • image-20230130171547970

Tight coupling between modelling and planning

  • image-20230131135918292
  • it requires finding trajectories that are both physically realistic under $p_\theta(\tau)$ and high-reward (or constraint-satisfying) under $h(\tau)$
    • because the dynamics information is separated from the perturbation distribution $h(\tau)$, a single diffusion model $p_\theta(\tau)$ may be reused for multiple tasks in the same environment.

Algorithm

  • image-20230131163923412

Goal-conditioned RL as Inpainting

  • some planning problems are more naturally posed as constraint satisfaction than reward maximization.
  • In practice, this may be implemented by sampling from the unperturbed reverse process $\tau^{i-1} \sim p_\theta(\tau^{i-1} | \tau^i)$ and replacing the sampled values with conditioning values $c_t$ after all diffusion timesteps $i \in {0,1,…,N}$

Task compositionality

  • while diffuser contains information about both environment dynamics and behaviours, it is independent of reward function. Because the model acts as a prior over possible futures, planning can be guided by comparatively lightweight perturbation functions $h(\tau)$ (or even combinations of multiple perturbations)
  • image-20230131181013011