Stefan O Toole Width Based Lookaheads With Learnt Base Policies and Heuristics Over the Atari 2600 Benchmark 2021 Paper Reivew

[TOC]

Title: Width-based Lookaheads with Learnt Base Policies and Heuristics Over the Atari-2600 Benchmark
Author: Stefan O’Toole et. al.
Publish Year: 2021
Review Date: Tue 16 Nov 2021

Summary of paper

This needs to be only 1-3 sentences, but it demonstrates that you understand the paper and, moreover, can summarize it more concisely than the author in his abstract.

This paper proposed a new width-based planning and learning agent that can play Atari-2600 games (though it cannot play Montezuma’s Revenge). The author claimed that width-based planning exploration plus (greedy) optimal MDP policy exploitation is able to achieve better performance than Monte-Carlo Tree Search.



The core algorithm for the Atari agent

The procedue of the width-based planning is the following:

Form a lookahead tree. Maintaining the tree require the agent to access to the simulator (i.e., the simulator has to provide future states to the agent before it takes the actual move (foresee the future states))
In order to search for the useful potential next states, the agent use width-based planning to expand the tree.
- the width-based planner check if the state is novel or not, it will explore the novel states
after deciding which states need to be explored, the pointer will move to the new state and the agent will start a new round of exploration again (loop)
After it reaches the terminal states or the computational budget is used up, we stop the exploration and the lookahead tree is completed
We select the action that maximise the accumulated reward in the lookahead tree.
Now we perform the action in the real game environment (though we has already explored the game environment during planning phase,)
during testing time, the agent is not able to access to the simulator to build lookahead tree (maybe it can, but in reality, the agent cannot execute actions multiple times and replay the task), thus the agent also used a neural network to mimic the policy during the planning phase.

Some key terms

Width-based planner

Width-based planner must has a lookahead memory space that allows the agent to plan its moves. (Check the procedure explanaion in the previous section)

When the width-based planner plans its moves, it will prefer the states that are novel. IW(1) considers a state in the lookahead as novel if it is the first state within the lookahead to make a particular feature within the feature set true (i.e. novel)

Therefore the width-based planner helps to explore the game environmnent because it always expands the novel states.

After it reached terminal states or the computational budgets are used up, the current exploration round ends.

Novelty features Novelty features is essential for width-based planner. The author suggests several novelty features for the Atari game.

The pixel difference of the game display (simple and vanilla)
B-PROST — capture temporal and spatial relationship between the past and present screen pixel
latent representation of the game display obtained by Variational Auto Encoder (VAE)
RAM of the game (this is tricky)

Critical Path

The author refered “critical path” as the trajectory fromed by optimal state-action from the lookahead memory.

Learning schedule

The author wants to ensure that the new policy learnt from the current episode is better than the previous policy.

Therefore, the author used Welch’s t-test to test if the performance improves or not. If not, the new parameters learned from this episode will be dropped.

This mechanism is similar to residual connection that can prevent performance loss due to unstable training.

Online Planning

Online planning means the planner tried to find a good path given partial information. it is required to access to the simulator in real time to obtain new information to expand the knowledge

Good things about the paper (one paragraph)

This is not always necessary, especially when the review is generally favorable. However, it is strongly recommended if the review is critical. Such introductions are good psychology if you want the author to drastically revise the paper.

The paper explains in detail the implementation of the width-planning algorithm.

Major comments

Discuss the author’s assumptions, technical approach, analysis, results, conclusions, reference, etc. Be constructive, if possible, by suggesting improvements.

The abstract says “This analysis of thegames provides further insight into the behaviour and performance of the algorithms introduced.”

In fact I would like to get more concrete information about what is the insights. I am not sure if explaining in more details in the abstract helps more.

Minor comments

This section contains comments on style, figures, grammar, etc. If any of these are especially poor and detract from the overall presentation, then they might escalate to the ‘major comments’ section. It is acceptable to write these comments in list (or bullet) form.

A little bit messy when explaning the algorithm

Incomprehension

List what you don’t understand.

There is no clear mathematical proof that width-based-planning is better than tranditional Monte-Carlo Tree Search. I believe this work compete with MCTS but there is not much comparison in this paper.

Potential future work

List what you can improve from the work

Summary of paper#

Some key terms#

Good things about the paper (one paragraph)#

Major comments#

Minor comments#

Incomprehension#

Potential future work#