Accepted Paper: Model-Based Reinforcement Learning Exploiting State-Action Equivalence

Best Papers in Special Session -- Day 2 (Nov.18), poster session: 11:00-13:30, talks: 15:50-16:10 & 17:20-17:40 (Nagoya Nohgakudo)
Poster number: Mon43
Download paper

Authors

Mahsa Asadi (Inria); Mohammad Sadegh Talebi (Inria); Hippolyte Bourel (ENS Rennes); Odalric Maillard (Inria)

Abstract

Leveraging an equivalence property on the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structurein the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence bounds, when the learner knows the underlying structure in advance. These bounds are provably tighter than their corresponding equivalence-oblivious confidence bounds. For the more challenging case of unknown equivalence structure, we present an algorithm called \ClusteringAlgo\ that seeks to find an (approximate) equivalence structure, and define confidence bounds using the approximate equivalence. To illustrate the efficacy of the presented confidence bounds, we present \CUCRL, as a natural modification of \UCRL\ for RL in undiscounted MDPs. In the case of known equivalence structure, we show that \CUCRL\ improves over \UCRL\ in terms of \emph{regret} by a factor $\sqrt{SA/C}$, in any communicatingMDP with $S$ states, $A$ actions, and $C$ classes, which corresponds to a massive improvement when $C\ll SA$. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure of the MDP is efficiently exploited. In the case of unknown equivalence structure, we show through numerical experiments that \CUCRL\ combined with \ClusteringAlgo\ outperforms over \UCRL\ in ergodic MDPs.