[TOC]

  1. Title: DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards
  2. Author: Shanchuan Wan et. al.
  3. Publish Year: 18 May 2023
  4. Review Date: Fri, Apr 12, 2024
  5. url: https://arxiv.org/abs/2304.10770

Summary of paper

image-20240412151513920

Motivation

Contribution

Some key terms

internal rewards

exploration, what is it?

mutual information term

discriminative forward model

image-20240413113530654

RND

In probability notation, the symbols “;” and “,” are used to distinguish between different types of conditioning in a probabilistic model, specifically in the context of conditional probability distributions. Here’s what each symbol typically represents:

  1. "," (comma): The comma is used to denote joint probabilities or to list multiple conditions in a conditional probability. It essentially represents a logical AND. For example, ( P(A, B) ) refers to the probability of both ( A ) and ( B ) occurring. In a conditional setting, ( P(X | Y, Z) ) indicates the probability of ( X ) given that both ( Y ) and ( Z ) are true.

  2. ";" (semicolon): The semicolon is used less frequently and serves to separate variables that are part of the conditioning environment from those that we condition upon. It’s particularly useful in more complex scenarios where clarity is needed about the nature of the conditioning. For example, ( P(X ; Y | Z) ) can be interpreted as the probability of ( X ), considering ( Y ) as a parameter or a fixed condition, given ( Z ). This notation isn’t standard in all statistical texts but can be seen in fields like information theory or in specific research papers where distinctions between types of conditions are crucial.

P(X;Y|Z) vs P(X,Y|Z)

In general, the use of “;” in probability notation is much rarer and less standardized than “,”, and its meaning can depend heavily on the specific context or the author’s definitions. If you encounter this in a text or a paper, it’s advisable to look for an explanation or definition provided in that document to understand exactly how the author intends it to be interpreted.

Conditional Mutual Information

https://arxiv.org/abs/1906.01824

image-20240413161950606

Episodic Intrinsic rewards

scaling the novelty

Mutual Information Formula:

image-20240526232546902

This tells us how much the action (a_t) impacts the change in observation, beyond what could be expected just from knowing the current and previous states alone.

  1. Bretagnolle–Huber Inequality: $$ \mathrm{D_{KL}}(P | Q) \geq - \mathrm{log} (1 - \mathrm{d^2_{TV}}(P, Q) ), $$

    A mathematical tool used here to simplify and lower-bound the KL divergence, which is a measure of how one probability distribution diverges from a second, expected probability distribution.

  2. Simplified Surrogate Function: $$ \mathrm{D_{KL}}(P | Q) \geq \mathrm{log}{\left( \mathrm{dist}(s_t, s_i) \right)} + \frac{\mathrm{dist}(o_{t+1}, o_i)}{\mathrm{dist}(s_t, s_i)} + \mathrm{const}. $$

    This simplifies the KL divergence using the total variation distance and distances between states and observations, providing a way to estimate the mutual information in terms of observable quantities.

  3. Final Formulation: $$ J \ge \min_{i}{\frac{\mathrm{dist}^2(o_{t+1}, o_i)}{\mathrm{dist}(s_t, s_i)}}, $$

    This provides a lower bound for the original objective, focusing on the squared ratio of differences between new and old observations to the differences in the corresponding states. This formulation simplifies the calculation while maintaining effectiveness.

Assumption

Distribution $P (x) = p(x|s_t, s_i, a_t)$, and assume given $s_t$, $s_i$, $a_t$ in deterministic environments (including POMDPs), $s_{t+1}$, $o_{t+1}$ and $D_{t+1, i}$, are uniquely determined. P is thus a unit impulse function which has the only non-zero value at $D_{t+1,i}$:

image-20240413163143082

convert $-\log(Q(D_{t+1,i}))$, where $Q(D_{t+1,i})$ = $p(dist(o_{t+1}, o_i) | s_t, s_i, a_t)$

The logarithmic transformation of the probability density function (PDF) of the exponential distribution is used for simplification and to derive further insights. Here’s how it works:

image-20240526232503857

This simplifies using the properties of logarithms ($-\log(ab) = -\log a - \log b$​) to:

image-20240526232522218

The term $\log(\mathrm{dist}(s_t, s_i))$ reflects the natural logarithm of the distance between states, while the other term linearly scales the observation distance by the inverse of the state distance, effectively normalizing it.

Now the issue is how to represent state

  1. Episodic Reward: The intrinsic reward $ r_t^{{\tiny I}} $ is calculated and applied within the confines of a single episode. This means the reward is based solely on the agent’s experiences and actions from the start of the episode to the current timestep $ t $.

  2. Observations and Trajectories:

    • Embeddings of Observations ($e_{obs_t}$): These are feature representations derived from the observations $ o_t $. Embeddings reduce complex observations into a form (like vectors) that captures essential information in a more manageable size.
    • Embeddings of Trajectories ($e_{traj_t}$): These are similar feature representations but for trajectories. A trajectory at time $ t $, $ \tau_t $, is likely a sequence or aggregation of states or observations leading up to $ t $. The embedding $ t_{traj_t} $ serves as a practical approximation of the state $ s_t $, especially useful in scenarios where $ s_t $ cannot be directly observed (common in Partially Observable Markov Decision Processes or POMDPs).
  3. Euclidean Distance: The reward formula uses the Euclidean distance between these embeddings to quantify differences between past and current observations and trajectories.

Intrinsic Reward Formula

The intrinsic reward at time $ t $, denoted as $ r_t^{{\tiny I}} $, is defined by:

$$ r_t^{\tiny I} = \min_{\forall i \in \left[0, t\right)} \left\lbrace \frac{ \mathrm{dist}^2 (e_{obs_i}, e_{obs_{t+1}}) }{ \mathrm{dist}(e_{traj_i}, e_{traj_t}) + \epsilon} \right\rbrace $$

Components of the Formula

Learning a discriminative model

image-20240413171251633

Results

The result is pretty good

https://github.com/swan-utokyo/deir