[TOC]

  1. Title: An Efficient Spatio-Temporal Pyramid Transformer for Action Detection
  2. Author: Yuetian Weng et. al.
  3. Publish Year: Jul 2022
  4. Review Date: Thu, Oct 20, 2022

Summary of paper

Motivation

Background

Efficient Spatio-temporal Pyramid Transformer

image-20221022182202983

Some key terms

Temporal feature pyramid network

MViT

Relation to existing video Transformers

  1. while others are based on separate space-time attention factorization, this method can encode the target motions by jointly aggregating spatio-temporal relations, without loss of spatio-temporal correspondence.
  2. apply local self-attention -> lower computational cost
  3. LSTA is data-dependent and flexible in terms of window size

Additional Illustration of LSTA

image-20221022201738733

Temporal Feature Pyramid

Refinement

after we predict class label $\hat y_i^C$ and boundary distances $(\hat b_i^s, \hat b_i^e)$, they further predict an offset $(\Delta \hat b_i^s, \Delta \hat b_i^e)$ and the refinement action category label $\hat y _i^R$

In the final prediction, we have the following form

image-20221022204827786

Potential future work

we may use this to construct lang rew module