Wed, Mar 31, 2021 @ 03:00 PM - 04:00 PM
Thomas Lord Department of Computer Science
Safe Reinforcement Learning via Offline Learning
Reinforcement Learning (RL) is a general learning paradigm to solve sequential decision making problems. They are often modeled as Markov Decision Process (MDP) or Partially Observable Markov Decision Process (POMDP). Reinforcement learning aims at learning policies that maximize the expected accumulated rewards with unknown dynamics or transition probabilities. Deep reinforcement learning (DRL) refers to using deep neural networks as a general function approximator when applying RL algorithms.
Despite recent success of RL algorithms in robotics, games (e.g. AlphaGo), RL algorithms pose particular challenges when applied to real world settings.
First, it often requires sufficient exploration effort to achieve a reasonable performance; such exploration is either too expensive (e.g. it takes time to gather data in real world) or forbidden due to safety constraints.
This limits the RL algorithms in the scenarios where an accurate simulator is available.
In this proposal, we focus on developing reinforcement learning algorithms that can ensure safety during the training phase and the deployment phase. We argue that by leveraging offline learning from a static dataset collected by existing safe policies, safety can be guaranteed.
However, standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (OOD) actions. This may cause the learned policies to visit unexplored and unsafe states at deployment phase. To mitigate this issue, we first mathematically show that by constraining the learned policies within the support set of the offline datasets, the state di stribution of the learned policy also lies within the support set of the offline datasets; hence safety is guaranteed.
To constrain the learned policies within the support set, we propose i) distribution matching, and ii) model-based OOD actions generalization detection.
We improve the existing state-of-the-art behavior regularization based approaches and propose BRAC+: Improved Behavior Regularized Actor Critic. We propose two key improvements including an analytical upper bound for the KL divergence as the behavior regularizor to reduce variance associated with sample based estimations, and gradient penalized Q update to avoid out-of-distribution (OOD) actions due to the unbounded gradient of the Q value w.r.t the OOD actions. Distribution matching is too conservative when the dataset is diverse so that the outcomes of the OOD actions can be correctly predicted. We propose to learn the inverse dynamics model as a variational auto-encoder along with the forward dynamics model. We detect OOD actions generalization by the agreement of the both models. Our approach will be evaluated on several benchmarks as well as a simulated building HVAC control testbed. We will gauge the success of our work by i) Whether the safety criteria is met. ii) The performance improvement over existing safe policies used to collect the dataset.
WebCast Link: https://usc.zoom.us/j/2488070010
Audiences: Everyone Is Invited
Contact: Lizsl De Leon