Publications#
-
preprintThe Dark Side of Rich Rewards: Understanding and Mitigating Noise in VLM RewardsSukai Huang, Nir Lipovetzky and Trevor CohnarXiv ePrint 2024
While Vision-Language Models (VLMs) are increasingly used to generate reward signals for training embodied agents to follow instructions, our research reveals that agents guided by VLM rewards often underperform compared to those employing only intrinsic (exploration-driven) rewards, contradicting expectations set by recent work. We hypothesize that false positive rewards -- instances where unintended trajectories are incorrectly rewarded -- are more detrimental than false negatives. Our analysis confirms this hypothesis, revealing that the widely used cosine similarity metric is prone to false positive reward estimates. To address this, we introduce BiMI ({Bi}nary {M}utual {I}nformation), a novel reward function designed to mitigate noise. BiMI significantly enhances learning efficiency across diverse and challenging embodied navigation environments. Our findings offer a nuanced understanding of how different types of reward noise impact agent learning and highlight the importance of addressing multimodal reward signal noise when training embodied agents
-
preprintPlanning in the Dark: LLM-Symbolic Planning Pipeline without ExpertsSukai Huang, Nir Lipovetzky and Trevor CohnarXiv ePrint 2024
Large Language Models (LLMs) have shown promise in solving natural language-described planning tasks, but their direct use often leads to inconsistent reasoning and hallucination. While hybrid LLM-symbolic planning pipelines have emerged as a more robust alternative, they typically require extensive expert intervention to refine and validate generated action schemas. It not only limits scalability but also introduces a potential for biased interpretation, as a single expert's interpretation of ambiguous natural language descriptions might not align with the user's actual intent. To address this, we propose a novel approach that constructs an action schema library to generate multiple candidates, accounting for the diverse possible interpretations of natural language descriptions. We further introduce a semantic validation and ranking module that automatically filter and rank the generated schemas and plans without expert-in-the-loop. The experiments showed our pipeline maintains superiority in planning over the direct LLM planning approach. These findings demonstrate the feasibility of a fully automated end-to-end LLM-symbolic planner that requires no expert intervention, opening up the possibility for a broader audience to engage with AI planning with less prerequisite of domain expertise.
-
preprintA Reminder of its Brittleness: Language Reward Shaping May Hinder Learning for Instruction Following AgentsSukai Huang, Nir Lipovetzky and Trevor CohnarXiv ePrint 2023
Teaching agents to follow complex written instructions has been an important yet elusive goal. One technique for improving learning efficiency is language reward shaping (LRS), which is used in reinforcement learning (RL) to reward actions that represent progress towards a sparse reward. We argue that the apparent success of LRS is brittle, and prior positive findings can be attributed to weak RL baselines. Specifically, we identified suboptimal LRS designs that reward partially matched trajectories, and we characterised a novel type of reward perturbation that addresses this issue based on the concept of loosening task constraints. We provided theoretical and empirical evidence that agents trained using LRS rewards converge more slowly compared to pure RL agents.
-
honours thesisAngry Birds Level Generation Using Walkthrough DescriptionsSukai HuangFor the degree of Bachelor of Advanced Computing (Honours) at The Australian National University
Angry Birds is a famous environment for agents to learn physical reasoning. How- ever, the deep reinforcement learning agents often underperform due to a lack of training set of game levels. To address the issue, procedural level generation is used to synthesise new Angry Birds game levels. However, the current rule-based Angry Birds procedural level generator is incapable of generating game levels that aid agents in learning physical reasoning, as it cannot guarantee the level of physical reasoning required in order to solve the generated game levels. Hence, in a new approach, we use walkthrough descriptions to generate Angry Birds game levels and train the Generative Adversarial Networks (GANs) based pro- cedural level generator by imitating the high-quality handcrafted levels. Unlike the conventional imitation approach, the proposed one is able to control the style of the generated game levels and also enhance the diversity of the game level dataset via manipulating the input walkthrough descriptions. Both qualitative and quantitative evaluations are conducted to demonstrate that the generated game levels using this method demand high level of physical reasoning to solve, just like the handcrafted game levels. Besides that, we developed a new Angry Birds walkthrough dataset called AbVat. It is a valuable dataset capable of facilitating a variety of meaningful research tasks in the domain of spatial-temporal understanding and reasoning.