[TOC]

  1. Title: Masked World Models for Visual Control 2022
  2. Author: Younggyo Seo et. al.
  3. Publish Year: 2022
  4. Review Date: Fri, Jul 1, 2022

https://arxiv.org/abs/2206.14244?context=cs.AI

https://sites.google.com/view/mwm-rl

Summary of paper

Motivation

TL:DR: Masked autoencoders (MAE) has emerged as a scalable and effective self-supervised learning technique. Can MAE be also effective for visual model-based RL? Yes! with the recipe of convolutional feature masking and reward prediction to capture fine-grained and task-relevant information.

Some key terms

Decouple visual representation learning and dynamics learning

compared to DreamerV2, the new model try to decouple visual learning and dynamics learning in model based RL.

  • specifically, for visual learning, we train a autoencoder with convolutional layers and vision transformer (ViT) to reconstruct pixels given masked convolutional features.
    • we also introduce auxiliary reward prediction objective for the observation autoencoder
  • for model based RL part, we learn a latent dynamics model that operates on the representation from the autoencoder.

image-20220701130107888

Early convolutional layers and masking out convolutional features instead of pixel patches

  • this approach enables the world model to capture fine-grained visual details from complex visual observations.

image-20220701125427531

  • we compare convolutional feature masking with pixel masking (i.e., MAE), which shows that convolutional feature masking significantly outperforms pixel masking.
    • some possible reasons:
      • raw pixel is noisy and less focused
      • conv features is more robust

Potential future work

We can use this model on our project

  • Conv feature masking
  • Dynamics learning