Xin_wang Reinforced Cross Modal Matching and Self Supervised Imitation Learning for Vision Language Navigation 2019

[TOC]

Title: Reinforced Cross Modal Matching and Self Supervised Imitation Learning for Vision Language Navigation 2019
Author: Xin Wang et. al.
Publish Year:
Review Date: Wed, Jan 18, 2023

Summary of paper

Visual Language Navigation (VLN) presents some unique challenges

first, reasoning over images and natural language instructions can be difficult.
secondly, except for strictly following expert demonstrations, the feedback is rather coarse, since the “Success” feedback is provided only when the agent reaches a target position (sparse reward)
A good “instruction following” trajectory may ended up just stop before you reaching the goal state and then receive zero rewards.
existing work suffer from generalisation problem. (need to retrain the agent in new environment)

agent can infer which sub-instruction to focus on and where to look at. (automatic splitting long instruction)
with a matching critic that evaluates an executed path by the probability of reconstructing the original instruction from the executed path. P(original instruction | past trajectory)
1. cycle reconstruction: we have P(target trajectory | the instruction) = 1, and we want to measure P(original instruction | past trajectory)
2. this will enhance the interpretability as now you understand how the robot was thinking about