Oguzhan_dogru Reinforcement Learning With Constrained Uncertain Reward Function Through Particle Filtering 2022

[TOC]

Title: Reinforcement Learning With Constrained Uncertain Reward Function Through Particle Filtering
Author: Oguzhan Dogru et. al.
Publish Year: July 2022
Review Date: Sat, Dec 24, 2022

Summary of paper

this study consider a type of uncertainty, which is caused by the sensor that are utilised for reward function. When the noise is Gaussian and the system is linear

this work used “particle filtering” technique to estimate the true reward function from the perturbed discrete reward sampling points.

Citation

complex control problems today are often solved by black-box methods that are data driven. this is because the SOTA techniques have made it possible to process high-dimensional data in real time. Despite the practicality of these techniques, this data-driven era has its challenges: data reliability.
- ref: Dogru, Oguzhan, Ranjith Chiplunkar, and Biao Huang. “Reinforcement learning with constrained uncertain reward function through particle filtering.” IEEE Transactions on Industrial Electronics 69.7 (2021): 7491-7499.
learning may be considered solving an optimisation problem with an associated reward function.
- ref: Dogru, Oguzhan, Ranjith Chiplunkar, and Biao Huang. “Reinforcement learning with constrained uncertain reward function through particle filtering.” IEEE Transactions on Industrial Electronics 69.7 (2021): 7491-7499.
Although uncertainty in the reward has been reported to degrade model/controller performance, no empirical analysis has been conducted on the RL algorithm’s tolerance for reward perturbations.
- ref: J. Wang, Y. Liu, and B. Li, “Reinforcement learning with perturbed rewards,” in Proc. AAAI Conf. Artif. Intell., 2020, vol. 34, pp. 6202–6209

limitation of the experiment setting

This particle filtering technique is not applicable for the sparse reward signal setting. Moreover, the noise filtering technique requires further rounds of simulation steps to generate estimate of the real reward, which makes the RL training further sample inefficient.