- Session 7: Reinforcement Learning -- Day 3 (Nov.19), poster session: 11:30-14:00, talks: 15:55-17:10 (5th floor Hall 1)
- Poster number: Tue37
- Download paper
Zichuan Lin (Tsinghua University); Li Zhao (Microsoft Research); Jiang Bian (Microsoft Research); Tao Qin (Microsoft Research Asia); Guangwen Yang (Tsinghua University)
Recent years have witnessed significant progress in solving challenging problems across various domains using deep reinforcement learning (RL). Despite the success, the weak robustness has risen as a big obstacle for applying existing RL algorithms into real problems. In this paper, we propose unified policy optimization (UPO), a sample-efficient shared policy framework that allows a policy to update itself by considering different gradients generated by different policy gradient (PG) methods. Specifically, we propose two algorithms called UPO-MAB and UPO-ES, to leverage these different gradients by adopting the idea of multi-arm bandit (MAB) and evolution strategies (ES), with the purpose of finding the gradient direction leading to more performance gain with less extra data cost. Extensive experiments show that our approach can lead to stronger robustness and better performance than baselines.