Geometric Value Iteration: Dynamic Error-Aware KL Regularization for Reinforcement Learning

Toshinori Kitamura (NAIST)*; lingwei zhu (NAIST); Takamitsu Matsubara (NAIST)

Abstract

The recent booming of entropy-regularized literature reveals that Kullback-Leibler (KL) regularization brings advantages to Reinforcement Learning (RL) algorithms by canceling out errors under mild assumptions. However, existing analyses focus on fixed regularization with a constant weighting coefficient and have not considered the case where the coefficient is allowed to change dynamically. In this paper, we study the dynamic coefficient scheme and present the first asymptotic error bound. Based on the dynamic coefficient error bound, we propose an effective scheme to tune the coefficient according to the magnitude of error in favor of more robust learning. On top of this development, we propose a novel algorithm: Geometric Value Iteration (GVI) that features dynamic error-aware KL coefficient design aiming to mitigate the impact of errors on the performance. Our experiments demonstrate that GVI can effectively exploit the tradeoff between learning speed and robustness over uniform averaging of constant KL coefficient. The combination of GVI and deep networks shows stable learning behavior even in the absence of a target network where standard value iteration algorithms like DQN would greatly oscillate or even fail to converge.