[TOC]

  1. Title: Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learning
  2. Author: Seungyong Moon et. al.
  3. Publish Year: 2 Nov 2023
  4. Review Date: Tue, Apr 2, 2024
  5. url: https://arxiv.org/abs/2307.03486

Summary of paper

image-20240402210833949

Contribution

Some key terms

Model based and explicit module in previous studies are not that good

requirements for modern agent

model-based methods

hierarchical methods

**Big picture of the methodology **

  1. the method periodically distills relevant information on achievements from episodes to the encoder via contrastive learning.
  2. maximize the similarity using optimal transport in the latent space between achievements from two different episodes, matching them using optimal transport, so that they will have the same semantic,

Problem Setting and Assumptions

image-20240404122402966

bootstrapped value function

Key observation

analyzing learned latent representations of the encoder

issue

FiLM layer

$$ \mathrm{FiLM}\theta(\phi\theta(s_t), a_t) = (1 + \eta_\theta(a_t)) \phi_\theta(s_t) + \delta_\theta(a_t), $$

where $\eta_\theta$ and $\delta_\theta$ are two-layer MLPs, each with a hidden size of 1024

Contrastive learning for achievement distillation

image-20240404224526000

  1. Entropic Regularization ($\alpha$ term):

    • The term $\alpha \sum_{i=1}^m \sum_{j=1}^n T_{ij} \log T_{ij}$ introduces entropic regularization, which encourages the transport plan to be smoother and more spread out. This discourages the solution from being too “spiky” (i.e., putting all probability mass into a single match), which can happen in cases of high dissimilarity or ambiguity. Entropic regularization makes the optimization problem easier to solve computationally and encourages solutions that are more robust to small variations in the data.
  2. Constraints:

    • The constraints ensure that the transport plan is valid. $T \mathbf{1} \leq \mathbf{1}$ and $T^\top \mathbf{1} \leq \mathbf{1}$ ensure that no more mass is transported from an achievement than is available, and no more mass is received by an achievement than is possible. The equality constraint $\mathbf{1}^\top T^\top \mathbf{1} = \min { m, n }$ ensures that the total transported mass equals the minimum sequence length, acknowledging that not all achievements can or should be matched.
1
2
3
4
5
6
7
# Match source and target goals
a = np.ones(len(states_s))
b = np.ones(len(states_t))
M = 1 - th.einsum("ik,jk->ij", states_s, states_t).cpu().numpy() # this calculates the cosine distance 
T = entropic_partial_wasserstein(a, b, M, reg=0.05, numItermax=100)
T = th.from_numpy(T).float()
row_inds, col_inds = th.where(T > 0.5)

It seems that the entropic_partial_wasserstein fully represents the soft matching equation, thus we need to check what is entropic_partial_wasserstein

using memory

next achievement prediction through vector alignment

image-20240404231634407

image-20240404231710197

Algorithm

image-20240404232546455

image-20240404232614308

Results

For instance, our method collects iron with a probability over 3%, which is 20 times higher than DreamerV3. This achievement is extremely challenging due to its scarcity on the map and the need for wood and stone tools.

other results

limitation

Limitation - Lack of Evaluation on Transferability: The text identifies a critical limitation in the work, highlighting that the transferability of the method to an unsupervised setting has not been evaluated. Specifically, it’s unclear how the approach would perform in scenarios where an agent operates without any predefined rewards. In traditional RL, rewards guide learning by providing feedback on the desirability of actions taken in different states. The concern here is whether the method would still be effective if an agent had to learn without such guidance.