Modular Reinforcement Learning Details

NOTES: for IGLU environment, we can test

Winner model
Decision Transformer
Modular RL agent

Background

we consider a multitask problem in shared environment

This environment is specified by a tuple

$(S, A, P, \gamma)$, with $S$ a set of states, $A$ a set of low-level actions (actions are common and shared in multitasks)

$P: S\times A\times S \rightarrow \mathbb R$ is a transition probability distribution,

$\gamma$ is a discount factor.

for each task $\tau \in \mathcal T$ is then specified by a pair $(R_{\tau}, \rho _\tau)$ with $R$ a special reward function and $\rho$ an initial distribution over states.

And we assume that tasks are annotated with sketches $K_\tau$, each task sketches consists of a sequence $(b_{\tau 1},b_{\tau 2},b_{\tau 3},…)$ of high-level symbolic labels drawn from a fixed vocabulary $\mathcal B$

Model

constructing for each sketch $b$ a corresponding sub-policy $\pi_b$

this sub-policy $\pi_b$ is shared over multitasks.

so the author believed that by sharing each policy across all tasks, this approach naturally learns the shared abstraction for the corresponding subtasks.

at each time step $t$, a subpolicy may select ether a low-level action $a \in A$ or a special STOP action.

potential issue
- agent’s actions are composed in a very complex way, e.g., in IGLU, the agent moves its viewsight, walks and put building blocks in the same time.
- so it is quite hard to split the IGLU task into concrete, non-overlapping sub-tasks.
  - Solution: can we just allow policy overlapping ???

this framework is agnostic to the implementation of subpolicies:

it means that for different sub-policies, we can use different architecture
- e.g., use classical planning for navigation sub-policy

control is passed from $\pi_{b_{i}}$ to $\pi_{b_{i+1}}$ when STOP signal is emitted.

The learning problem is to optimize over all $\theta_b$ to maximise expected discounted reward

Policy Optimisation

Actor - critic method

Policy-based RL supports stochastic policy which is good in some cases

So what objectives we want to optimised

again, maximise the accumulative reward
or maximise the value of the states

average reward per time-step -> you may want to accumulate it / or just maximise the final reward

actually, the $\mu$ distribution is the time ratio that we would stay in the states based on current policy

All the slides above is focusing on immediate reward, but we want to extend it to accumulative reward (a.k.a. value of states)

in other words, the policy only affects the decision but it doesn’t change the dynamics of the environment.

Reduce the variance to make the optimisation process more stable

E over trajectories

the “baseline” property allows us to remove some rewards R in the formula that is not dependent on the Action A

Continue our baseline function

this is on policy TD learning

(BTW, in some cases, this is exact the same updates as Neural Q learning)

be careful of the updates, because you are changing the policy

Back to the paper

so the paper compute the gradient steps of the following form

For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we denote the empirical return starting in state $s_i$ as $q_i := \sum_{j=i+1} \gamma^{j-i-1}R(s_j)$

(ok, this full return would give high variance but low bias)

$c$ can achieve close to optimal variance when it is set exactly equal to the state value function $V_{\pi}(s_i) = \mathbf E_{\pi}q_i$ for the target policy $\pi$ starting in state $s_i$

in modular policies, we only have one critic per task.

because subpolicies $\pi_b$ might participate in a number of composed policies $\Pi_{\tau}$, each would associated with its own reward function $R_{\tau}$. Thus individual subpolicies are not uniquely identified with value functions.

The gradient for subpolicy $\pi_{b}$

For critic, we approximate it using a network with parameters $\eta_{\tau}$

We train it by minimising a squared error criterion, with gradients given by

(note that this is minimisation and the previous reward on is maximisation)

note that this baseline function (i.e., critic function) is $c_{\tau}(s_i)$, thus it is dependent on both task and state identity

a complete procedure for computing a single gradient step

The complete procedure for computing a single gradient step is given as follow

(well, we can add more stuffs to stablise the training process)

the curriculum learning

This is the part we can use “Width”

Background#

Model#

Policy Optimisation#

Actor - critic method#

Back to the paper#

a complete procedure for computing a single gradient step#

the curriculum learning#