NOTES: for IGLU environment, we can test

  1. Winner model
  2. Decision Transformer
  3. Modular RL agent

Background

we consider a multitask problem in shared environment

This environment is specified by a tuple

$(S, A, P, \gamma)$, with $S$ a set of states, $A$ a set of low-level actions (actions are common and shared in multitasks)

$P: S\times A\times S \rightarrow \mathbb R$ is a transition probability distribution,

$\gamma$ is a discount factor.

for each task $\tau \in \mathcal T$ is then specified by a pair $(R_{\tau}, \rho _\tau)$ with $R$ a special reward function and $\rho$ an initial distribution over states.

And we assume that tasks are annotated with sketches $K_\tau$, each task sketches consists of a sequence $(b_{\tau 1},b_{\tau 2},b_{\tau 3},…)$ of high-level symbolic labels drawn from a fixed vocabulary $\mathcal B$

Model

constructing for each sketch $b$ a corresponding sub-policy $\pi_b$

this sub-policy $\pi_b$ is shared over multitasks.

at each time step $t$, a subpolicy may select ether a low-level action $a \in A$ or a special STOP action.

this framework is agnostic to the implementation of subpolicies:

control is passed from $\pi_{b_{i}}$ to $\pi_{b_{i+1}}$ when STOP signal is emitted.

image-20211226201043159

The learning problem is to optimize over all $\theta_b$ to maximise expected discounted reward

image-20211226201319448

Policy Optimisation

Actor - critic method

image-20211226205752184

image-20211226214759357

Policy-based RL supports stochastic policy which is good in some cases

image-20211227154539866

image-20211227155219600

So what objectives we want to optimised

image-20211227160013980

average reward per time-step -> you may want to accumulate it / or just maximise the final reward

actually, the $\mu$ distribution is the time ratio that we would stay in the states based on current policy

image-20211227183120536

image-20211227183254215

image-20211227191004452

image-20211227191237079

image-20211227191858533

image-20211228140246750

image-20211228145421057

All the slides above is focusing on immediate reward, but we want to extend it to accumulative reward (a.k.a. value of states)

image-20211228145917522

image-20211228150043844

image-20211228150846753

in other words, the policy only affects the decision but it doesn’t change the dynamics of the environment.

Reduce the variance to make the optimisation process more stable

image-20211228151743008

image-20211228154504687

E over trajectories

the “baseline” property allows us to remove some rewards R in the formula that is not dependent on the Action A

Continue our baseline function

image-20211228155923638

this is on policy TD learning

image-20211228160756929

image-20211228161327147

(BTW, in some cases, this is exact the same updates as Neural Q learning)

image-20211228162335709

image-20211228163337379

image-20211228163455391

be careful of the updates, because you are changing the policy

image-20211228164025270

image-20211228164128192

image-20211228164657920

image-20211228164712513

image-20211228165310271


Back to the paper

so the paper compute the gradient steps of the following form

image-20211228215732135

For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we denote the empirical return starting in state $s_i$ as $q_i := \sum_{j=i+1} \gamma^{j-i-1}R(s_j)$

(ok, this full return would give high variance but low bias)

$c$ can achieve close to optimal variance when it is set exactly equal to the state value function $V_{\pi}(s_i) = \mathbf E_{\pi}q_i$ for the target policy $\pi$ starting in state $s_i$

in modular policies, we only have one critic per task.

The gradient for subpolicy $\pi_{b}$

image-20211228221824581

For critic, we approximate it using a network with parameters $\eta_{\tau}$

We train it by minimising a squared error criterion, with gradients given by

image-20211228222738134

(note that this is minimisation and the previous reward on is maximisation)

note that this baseline function (i.e., critic function) is $c_{\tau}(s_i)$, thus it is dependent on both task and state identity

a complete procedure for computing a single gradient step

The complete procedure for computing a single gradient step is given as follow

image-20211228224341468

(well, we can add more stuffs to stablise the training process)

the curriculum learning

image-20211228224440839

This is the part we can use “Width”