Itsugun_cho Deep Rl With Hierarchical Action Exploration for Dialogue Generation 2023

[TOC]

Summary of paper

Approximate dynamic programming applied to dialogue generation involves policy improvement with action sampling. However, such a practice is inefficient for reinforcement learning because the eligible (high action value) responses are very sparse, and the greedy policy sustained by the random sampling is flabby.

this paper shows that the performance of dialogue policy positively correlated with sampling size by theoretical and experimental.
we introduce a novel dual-granularity Q-function to alleviate this limitation by exploring the most promising response category to intervene the sampling.

limitation of the maximum likelihood estimation (MLE) objective for the probability distribution of responses

However, this supervised technique is insufficient to learn a long-term behaviour since the corpus often contains suboptimal dialogues, and MLE cannot model the future direction of the conversation.
if we instead view the open-domain dialogue as a control problem, frameworks such as reinforcement learning (RL) could allow agents automatically adjust policy concerning the pre-defined appraisal function via a trial-and-error process.

word generation based on the elevated abstraction category

if we apprehend which abstract category of actions can obtain a higher Q-value, then generating responses of that category for the greedy policy will make training more efficient.
we designed a coarse-grained Q-function by category-represented responses aiming to lock the optimal category and a fine-grained Q-function by token-represented response striving to extract the optimal action.
in this way, the infinite action space is divided into several blocks at the high-level abstraction and thus can ergodic entire action space to adapt policy on the fly.

four reward functions

the cosine similarity between the agent’s response and dull responses (e.g., “I don’t know”) . An expression that lack emotional engagement may limit the development of dialogue.
the outpouring of surprise emotion (this is from the training dataset, the mood of the human user)
the length of responses
the asking questions (reinforce the agent to ask questions)