The Berkeley Artificial Intelligence Research BlogThe BAIR Blog
https://bairblog.github.io/
Physically Realistic Attacks on Deep Reinforcement Learning<!--
Be careful that these three lines are at the top, and that the title and image change for each blog post!
-->
<meta name="twitter:title" content="Physically Realistic Attacks on Deep Reinforcement Learning" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/attacks/04_cycle.png" />
<div class="videoWrapper">
<iframe src="https://www.youtube.com/embed/XPFQ9TBvtCE?rel=0" frameborder="0" allowfullscreen=""></iframe>
</div>
<p><br /></p>
<p>Deep reinforcement learning (RL) has achieved superhuman performance in
problems ranging from <a href="https://deepmind.com/blog/article/safety-first-ai-autonomous-data-centre-cooling-and-industrial-control">data center cooling</a> to <a href="https://openai.com/blog/openai-five/">video games</a>. RL policies
may soon be widely deployed, with research underway in <a href="http://proceedings.mlr.press/v78/dosovitskiy17a/dosovitskiy17a.pdf">autonomous driving</a>,
<a href="https://arxiv.org/abs/1706.05125">negotiation</a> and <a href="https://www.ft.com/content/16b8ffb6-7161-11e7-aca6-c6bd07df1a3c">automated trading</a>. Many potential applications are
safety-critical: automated trading failures caused <a href="https://www.sec.gov/litigation/admin/2013/34-70694.pdf">Knight Capital to lose
USD 460M</a>, while faulty autonomous vehicles have resulted in <a href="https://www.ntsb.gov/investigations/AccidentReports/Pages/HWY16FH018-preliminary.aspx">loss</a> of
<a href="https://en.wikipedia.org/wiki/Death_of_Elaine_Herzberg">life</a>.</p>
<p>Consequently, it is critical that RL policies are robust: both to naturally
occurring distribution shift, and to malicious attacks by adversaries.
Unfortunately, we find that RL policies which perform at a high-level in normal
situations can harbor serious vulnerabilities which can be exploited by an
adversary.</p>
<!--more-->
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/attacks/00_prior_work.png" width="" /><br />
</p>
<p><a href="https://arxiv.org/abs/1702.02284">Prior</a> <a href="https://arxiv.org/abs/1705.06452">work</a> has shown deep RL policies are vulnerable to small
adversarial perturbations to their observations, similar to <a href="https://arxiv.org/abs/1312.6199">adversarial
examples</a> in image classifiers. This threat model assumes the adversary can
directly modify the victim’s sensory observation. Such low-level access is
rarely possible. For example, an autonomous vehicle’s camera image can be
influenced by other drivers, but only to a limited extent. Other drivers cannot
add noise to arbitrary pixels, or make a building disappear.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/attacks/01_multi_agent.png" width="500" /><br />
</p>
<p>By contrast, we model the victim and adversary as agents in a shared
environment. The adversary can take a similar set of actions to the victim.
These actions may indirectly change the observations the victim sees, but only
in a physically realistic fashion.</p>
<p>Note that if the victim policy were to play a <a href="https://en.wikipedia.org/wiki/Nash_equilibrium">Nash equilibria</a>, it would
not be exploitable by an adversary. We therefore focus on attacking victim
policies trained via <a href="http://proceedings.mlr.press/v37/heinrich15.pdf">self-play</a>, a popular method that approximates Nash
equilibria. While it is known self-play may not always converge, it has
produced highly capable AI systems. For example, <a href="https://deepmind.com/research/case-studies/alphago-the-story-so-far">AlphaGo</a> and <a href="https://openai.com/blog/openai-five/">OpenAI</a>
Five have beaten world Go champions, and a professional Dota 2 team.</p>
<p>We find it is still possible to attack victim policies in this more realistic
multi-agent threat model. Specifically, we exploit state-of-the-art policies
trained by <a href="https://openai.com/blog/competitive-self-play/">Bansal et al</a> from OpenAI in zero-sum games between simulated
Humanoid robots. We train our <em>adversarial policies</em> against a fixed victim
policy, for less than 3% as many timesteps as the victim was trained for. In
other respects, it is trained similarly to the self-play opponents: we use the
same RL algorithm, <a href="https://openai.com/blog/openai-baselines-ppo/">Proximal Policy Optimization</a>, and the same sparse
reward. Surprisingly, the adversarial policies reliably beat most victims,
<em>despite not standing up and instead flailing on the ground</em>.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/attacks/02_envs.png" width="" /><br />
</p>
<p>In the video at the top of the post, we show victims in three different
environments playing normal self-play opponents and adversarial policies. The
<em>Kick and Defend</em> environment is a penalty shootout between a victim kicker and
goalie opponent. <em>You Shall Not Pass</em> has a victim runner trying to cross the
finish line, and an opponent blocker trying to prevent them. <em>Sumo Humans</em> has
two agents competing on a round arena to knock out their opponent.</p>
<p>In <em>Kick and Defend</em> and <em>You Shall Not Pass</em>, the adversarial policy never
stands up nor touches the victim. Instead, it positions its body in such a way
to cause the victim’s policy to take poor actions. This style of attack is
impossible in <em>Sumo Humans</em>, where the adversarial policy would immediately
lose if it fell over. Instead, the adversarial policy learns to kneel in the
center in a stable position, which proves surprisingly effective.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/attacks/03_masked.png" width="" /><br />
</p>
<p>To better understand how the adversarial policies exploit their victims, we
created “masked” versions of victim policies. The masked victim always observes
a static value for the opponent position, corresponding to a typical initial
starting state. This doctored observation is then passed to the original victim
policy.</p>
<div class="videoWrapper">
<iframe src="https://www.youtube.com/embed/RFXdb8YmARA?rel=0" frameborder="0" allowfullscreen=""></iframe>
</div>
<p><br /></p>
<p>One would expect performance to degrade when the policy cannot see its
opponent, and indeed the masked victims win less often against normal
opponents. However, they are <em>far</em> more robust to adversarial policies. This
result shows that the adversarial policies win by taking actions to induce
<em>natural</em> observations that are adversarial to the victim, and not by
physically interfering with the victim.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/attacks/04_cycle.png" width="500" /><br />
</p>
<p>Furthermore, these results show there is a cyclic relationship between the
policies. There is no overall strongest policy: the best policy depends on the
other player’s policy, like in <a href="https://en.wikipedia.org/wiki/Rock_paper_scissors">rock-paper-scissors</a>. Technically this is
known as <a href="https://en.wikipedia.org/wiki/Nontransitive_game">non-transitivity</a>: policy A beats B which beats C, yet C beats A.
This is surprising since these environments’ real-world analogs are
(approximately) transitive: professional human soccer players and sumo
wrestlers can <a href="https://www.youtube.com/watch?v=s5f8hjzxmkA">reliably beat amateurs</a>. Self-play <a href="http://proceedings.mlr.press/v97/balduzzi19a.html">assumes
transitivity</a> and so this may be why the self-play policies are vulnerable
to attack.</p>
<div class="videoWrapper">
<iframe src="https://www.youtube.com/embed/hfwKeyhVufU?rel=0" frameborder="0" allowfullscreen=""></iframe>
</div>
<p><br /></p>
<p>Of course in general we don’t want to completely blind the victim, since this
hurts performance against normal opponents. Instead, we propose adversarial
training: fine-tuning the victim policy against the adversary that has been
trained against it. Specifically, we fine-tune for 20 million timesteps, the
same amount of experience the adversary is trained with. Half of the episodes
are against an adversary, and the other half against a normal opponent. We find
the fine-tuned victim policy is robust to the adversary it was trained against,
and suffers only a small performance drop against a normal opponent.</p>
<p>However, one might wonder if this fine-tuned victim is robust to our <em>attack
method</em>, or just the adversary it was fine-tuned against. Repeating the attack
method finds a new adversarial policy:</p>
<div class="videoWrapper">
<iframe src="https://www.youtube.com/embed/sY9uUZqXsl4?rel=0" frameborder="0" allowfullscreen=""></iframe>
</div>
<p><br /></p>
<p>Notably, the new adversary trips the victim up rather than just flailing
around. This suggests our new policies are meaningfully more robust (although
there may of course be failure modes we haven’t discovered).</p>
<p>The existence of adversarial policies has significant implications for the
training, understanding and evaluation of RL policies. First, adversarial
policies highlight the need to move beyond self-play. Promising approaches
include iteratively applying the adversarial training defence above, and
<a href="https://arxiv.org/pdf/1807.01281.pdf">population-based training</a> which naturally trains against a broader range
of opponents.</p>
<p>Second, this attack shows that RL policies can be vulnerable to adversarial
observations that are on the manifold of naturally occurring data. By contrast,
most prior work on adversarial examples has produced physically unrealistic
perturbed images.</p>
<p>Finally, these results highlight the limitations of current evaluation
methodologies. The victim policies have strong average-case performance against
a range of both normal opponents and random policies. Yet their worst-case
performance against adversarial policies is extremely poor. Moreover, it would
be difficult to find this worst-case by hand: the adversarial policies do not
seem like challenging opponents to human eyes. We would recommend testing
safety-critical policies by adversarial attack, constructively lower bounding
the policies’ exploitability.</p>
<p>To find out more, check out <a href="https://arxiv.org/abs/1905.10615">our paper</a> or visit the <a href="https://adversarialpolicies.github.io/">project website</a>
for more example videos.</p>
Fri, 27 Mar 2020 02:00:00 -0700
https://bairblog.github.io/2020/03/27/attacks/
https://bairblog.github.io/2020/03/27/attacks/Does On-Policy Data Collection Fix Errors in Off-Policy Reinforcement Learning?<!--
TODO TODO TODO
Be careful that these three lines are at the top, and that the title and image change for each blog post!
-->
<meta name="twitter:title" content="Does On-Policy Data Collection Fix Errors in Off-Policy Reinforcement Learning?" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/discor/discor.png" />
<p>Reinforcement learning has seen a great deal of success in solving complex decision making problems ranging from <a href="https://arxiv.org/abs/1806.10293">robotics</a> to <a href="https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning">games</a> to <a href="http://www.wi-frankfurt.de/publikationenNeu/AReinforcementLearningApproach.pdf">supply chain management</a> to <a href="https://arxiv.org/pdf/1810.12027.pdf">recommender systems</a>. Despite their success, deep reinforcement learning algorithms can be exceptionally difficult to use, due to unstable training, sensitivity to hyperparameters, and generally unpredictable and poorly understood convergence properties. Multiple explanations, and corresponding solutions, have been proposed for improving the stability of such methods, and we have seen good progress over the last few years on these algorithms. In this blog post, we will dive deep into analyzing a central and underexplored reason behind some of the problems with the class of deep RL algorithms based on dynamic programming, which encompass the popular <a href="https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf">DQN</a> and soft actor-critic (<a href="https://arxiv.org/abs/1812.05905">SAC</a>) algorithms – the detrimental connection between data distributions and learned models.</p>
<!--
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_C0A5DF53824B57146C6C7BFA4F136835C682554FAEA4D3AC837E9CAA53C2DDCA_1583955854042_SupvsRL.svg" width="">
<br />
<i>
Figure 1: Distributions can impact the generalization properties of supervised
learning algorithms due to shift between train and test distributions. In RL,
besides generalization, distributions also affects other elements in the
learning process such as the actual updates performed, exploration and has a
significant impact on learning progress even in the absence of explicit
distribution shift.
</i>
</p>
-->
<!--more-->
<p>Before diving deep into a description of this problem, let us quickly recap
some of the main concepts in dynamic programming. Algorithms that apply dynamic
programming in conjunction with function approximation are generally referred
to as approximate dynamic programming (ADP) methods. ADP algorithms include
some of the most popular, state-of-the-art RL methods such as variants of deep
Q-networks (DQN) and soft actor-critic (SAC) algorithms. ADP methods based on
Q-learning train action-value functions, <script type="math/tex">Q(s, a)</script>, via a Bellman backup. In
practice, this corresponds to training a parametric function, <script type="math/tex">Q_\theta(s,
a)</script>, by minimizing the mean squared difference to a backup estimate of the
Q-function, defined as:</p>
<script type="math/tex; mode=display">\mathcal{B}^*Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s'|s, a} [\max_{a'} \bar{Q}(s', a')],</script>
<p>where <script type="math/tex">\bar{Q}</script> denotes a previous instance of the original Q-function,
<script type="math/tex">Q_\theta</script>, and is commonly referred to as a target network. This update is
summarized in the equation below.</p>
<script type="math/tex; mode=display">\theta \leftarrow \arg \min_\theta \mathbb{E}_{s, a \sim \mathcal{D}} \left[(Q_\theta(s, a)- (r(s, a) + \gamma \mathbb{E}_{s'|s, a} [\max_{a'}\bar{Q}(s', a')]))^2 \right]</script>
<p>An analogous update is also used for
<a href="https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf">actor-critic</a>
methods that also maintain an explicitly parametrized policy,
<script type="math/tex">\pi_\phi(a|s)</script>, alongside a Q-function. Such an update typically replaces
<script type="math/tex">\max_{a'}</script> with an expectation under the policy, <script type="math/tex">\mathbb{E}_{a' \sim
\pi_\phi}</script>. We shall use the <script type="math/tex">\max_{a'}</script> version for consistency throughout,
however, the actor-critic version follows analogously. These ADP methods aim at
learning the optimal value function, <script type="math/tex">Q^*</script>, by applying the Bellman backup
iteratively untill convergence.</p>
<!--
***
-->
<p>A central factor that affects the performance of ADP algorithms is the choice
of the training data-distribution, <script type="math/tex">\mathcal{D}</script>, as shown in the equation
above. The choice of <script type="math/tex">\mathcal{D}</script> is an integral component of the backup,
and it affects solutions obtained via ADP methods, especially since function
approximation is involved. Unlike tabular settings, function approximation
causes the learned Q function to depend on the choice of data distribution
<script type="math/tex">\mathcal{D}</script>, thereby affecting the dynamics of the learning process. We
show that on-policy exploration induces distributions <script type="math/tex">\mathcal{D}</script> such that
training Q-functions under <script type="math/tex">\mathcal{D}</script> may fail to correct systematic
errors in the Q-function, even if Bellman error is minimized as much as
possible – a phenomenon that we refer to as an absence of <strong><em>corrective
feedback</em></strong>.</p>
<h1 id="corrective-feedback-and-why-it-is-absent-in-adp">Corrective Feedback and Why it is Absent in ADP</h1>
<p>What is corrective feedback formally? How do we determine if it is present or
absent in ADP methods? In order to build intuition, we first present a simple
contextual bandit (one step RL) example, where the Q-function is trained to
match <script type="math/tex">Q^*</script> via supervised updates, without bootstrapping. This enjoys
corrective feedback, and we then contrast it with ADP methods, which do not. In
this example, the goal is to learn the optimal value function <script type="math/tex">Q^*(s, a)</script>,
which, is equal to the reward <script type="math/tex">r(s, a)</script>. At iteration <script type="math/tex">k</script>, the algorithm
minimizes the estimation error of the Q-function:</p>
<script type="math/tex; mode=display">\mathcal{L}(Q) = \mathbb{E}_{s \sim \beta(s), a \sim \pi_k(a|s)}[|Q_k(s, a) - Q^*(s, a)|].</script>
<p>Using an <script type="math/tex">\varepsilon</script>-greedy or Boltzmann policy for exploration, denoted by $\pi_k$, gives rise
to a <em>hard negative mining</em> phenomenon – the policy chooses precisely those
actions that correspond to possibly over-estimated Q-values for each state
<script type="math/tex">s</script> and observes the corresponding, <script type="math/tex">r(s, a)</script> or <script type="math/tex">Q^*(s, a)</script>, as a
result. Then, minimizing <script type="math/tex">\mathcal{L}(Q)</script>, on samples collected this way
corrects errors in the Q-function, as <script type="math/tex">Q_k(s, a)</script> is pushed closer to match
<script type="math/tex">Q^*(s, a)</script> for actions <script type="math/tex">a</script> with incorrectly high Q-values, correcting
precisely the Q-values which may cause sub-optimal performance. This
constructive interaction between online data collection and error correction –
where the induced online data distribution <em>corrects</em> errors in the value
function – is what we refer to as <strong>corrective feedback</strong>.</p>
<p>In contrast, we will demonstrate that ADP methods that rely on previous
Q-functions to generate targets for training the current Q-function, may not
benefit from corrective feedback. This difference between bandits and ADP
happens because the target values are computed by applying a Bellman backup on
the previous Q-function, <script type="math/tex">\bar{Q}</script> (target value), rather than the optimal
<script type="math/tex">Q^*</script>, so, errors in <script type="math/tex">\bar{Q}</script>, at the next states can result in incorrect
Q-value targets at the current state. No matter how often the current
transition is observed, or how accurately Bellman errors are minimized, the
error in the Q-value with respect to the optimal Q-function, <script type="math/tex">|Q - Q^*|</script>, at
this state is not reduced. Furthermore, in order to obtain correct target
values, we need to ensure that values at state-action pairs occurring at the
tail ends of the data distribution <script type="math/tex">\mathcal{D}</script>, which are primary causes of
errors in Q-values at other states, are correct. However, as we will show via a
simple didactic example, that this correction process may be extremely slow and
may not occur, mainly because of undesirable generalization effects of the
function approximator.</p>
<p>Let’s consider a didactic example of a tree-structured deterministic MDP with 7
states and 2 actions, <script type="math/tex">a_1</script> and <script type="math/tex">a_2</script>, at each state.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_B1B66C162F9CE3C53CD9315CF7237DAB864CB3AAEED950E42789F193B67C60EB_1584253706874_on_policy_figure_aliasing.png" width="" />
<br />
<i>
Figure 1: Run of an ADP algorithm with on-policy data collection. Boxed nodes
and circled nodes denote groups of states aliased by function approximation --
values of these nodes are affected due to parameter sharing and function
approximation.
</i>
</p>
<p>A run of an ADP algorithm that chooses the current on-policy state-action
marginal as <script type="math/tex">\mathcal{D}</script> on this tree MDP is shown in Figure 1. Thus,
the Bellman error at a state is minimized in proportion to the frequency of
occurrence of that state in the policy state-action marginal. Since the leaf
node states are the least frequent in this on-policy marginal distribution (due
to the discounting), the Bellman backup is unable to correct errors in Q-values
at such leaf nodes, due to their low frequency and aliasing with other states
arising due to function approximation. Using incorrect Q-values at the leaf
nodes to generate targets for other nodes in the tree, just gives rise to
incorrect values, even if Bellman error is fully minimized at those states.
Thus, most of the Bellman updates do not actually bring Q-values at the states
of the MDP closer to <script type="math/tex">Q^*</script>, since the primary cause of incorrect target
values isn’t corrected.</p>
<p>This observation is surprising, since it demonstrates how the choice of an
online distribution coupled with function approximation might actually learn
incorrect Q-values. On the other hand, a scheme that chooses to update states
level by level progressively (Figure 2), ensuring that target values used at
any iteration of learning are correct, very easily learns correct Q-values in
this example.</p>
<!--
![Figure 2: Run of an ADP algorithm with an oracle distribution, that updates states level-by level, progressing through the tree from the leaves to the root. Even in the presence of function approximation, selecting the right set of nodes for updates gives rise to correct Q-values.](https://paper-attachments.dropbox.com/s_B1B66C162F9CE3C53CD9315CF7237DAB864CB3AAEED950E42789F193B67C60EB_1584341988034_discor_func_approx_final.png)
-->
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_B1B66C162F9CE3C53CD9315CF7237DAB864CB3AAEED950E42789F193B67C60EB_1584341988034_discor_func_approx_final.png" width="" />
<br />
<i>
Figure 2: Run of an ADP algorithm with an oracle distribution, that updates
states level-by level, progressing through the tree from the leaves to the
root. Even in the presence of function approximation, selecting the right set
of nodes for updates gives rise to correct Q-values.
</i>
</p>
<h1 id="consequences-of-absent-corrective-feedback">Consequences of Absent Corrective Feedback</h1>
<p>Now, one might ask if an absence of corrective feedback occurs in practice,
beyond a simple didactic example and whether it hurts in practical problems.
Since visualizing the dynamics of the learning process is hard in practical
problems as we did for the didactic example, we instead devise a metric that
quantifies our intuition for corrective feedback. This metric, what we call
<em>value error,</em> is given by:</p>
<script type="math/tex; mode=display">\mathcal{E}_k = \mathbb{E}_{d^{\pi_k}} [|Q_k - Q^*|]</script>
<p>Increasing values of <script type="math/tex">\mathcal{E}_k</script> imply that the algorithm is pushing
Q-values farther away from <script type="math/tex">Q^*</script>, which means that corrective feedback is
absent, if this happens over a number of iterations. On the other hand,
decreasing values of <script type="math/tex">\mathcal{E}_k</script> implies that the algorithm is
continuously improving its estimate of <script type="math/tex">Q</script>, by moving it towards <script type="math/tex">Q^*</script> with
each iteration, indicating the presence of corrective feedback.</p>
<p>Observe in Figure 3, that ADP methods can suffer from prolonged periods where
this global measure of error in the Q-function, <script type="math/tex">\mathcal{E}_k</script>, is
increasing or fluctuating, and the corresponding returns degrade or stagnate,
implying an absence of corrective feedback.</p>
<!--
![Figure 3: Consequences of absent corrective feedback, including (a) sub-optimal convergence, (b) instability in learning and (c) inability to learn with sparse rewards.](https://paper-attachments.dropbox.com/s_B1B66C162F9CE3C53CD9315CF7237DAB864CB3AAEED950E42789F193B67C60EB_1583775913964_Screen+Shot+2020-03-09+at+10.44.58+AM.png)
-->
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_B1B66C162F9CE3C53CD9315CF7237DAB864CB3AAEED950E42789F193B67C60EB_1583775913964_Screen+Shot+2020-03-09+at+10.44.58+AM.png" width="" />
<br />
<i>
Figure 3: Consequences of absent corrective feedback, including (a) sub-optimal
convergence, (b) instability in learning and (c) inability to learn with sparse
rewards.
</i>
</p>
<p>In particular, we describe three different consequences of an absence of
corrective feedback:</p>
<ol>
<li>
<p><strong>Convergence to suboptimal Q-functions.</strong> We find that on-policy sampling
can cause ADP to converge to a suboptimal solution, even in the absence of
sampling error. Figure 3(a) shows that the value error <script type="math/tex">\mathcal{E}_k</script>
rapidly decreases initially, and eventually converges to a value significantly
greater than 0, from which the learning process never recovers.</p>
</li>
<li>
<p><strong>Instability in the learning process.</strong> We observe that ADP with replay
buffers can be unstable. For instance, the algorithm is prone to degradation
even if the latest policy obtains returns that are very close to the optimal
return in Figure 3(b).</p>
</li>
<li>
<p><strong>Inability to learn with low signal-to-noise ratio.</strong> Absence of corrective
feedback can also prevent ADP algorithms from learning quickly in scenarios
with low signal-to-noise ratio, such as tasks with sparse/noisy rewards as
shown in Figure 3(c). Note that this is not an exploration issue, since all
transitions in the MDP are provided to the algorithm in this experiment.</p>
</li>
</ol>
<h1 id="inducing-maximal-corrective-feedback-via-distribution-correction">Inducing Maximal Corrective Feedback via Distribution Correction</h1>
<p>Now that we have defined corrective feedback and gone over some detrimental
consequences an absence of it can have on the learning process of an ADP
algorithm, what might be some ways to fix this problem? To recap, an absence of
corrective feedback occurs when ADP algorithms naively use the on-policy or
replay buffer distributions for training Q-functions. One way to prevent this
problem is by computing an “optimal” data distribution that provides maximal
corrective feedback, and train Q-functions using this distribution? This way we
can ensure that the ADP algorithm always enjoys corrective feedback, and hence
makes steady learning progress. The strategy we used in our work is to compute
this optimal distribution and then perform a <strong>weighted Bellman update</strong> that
re-weights the data distribution in the replay buffer to this optimal
distribution (in practice, a tractable approximation is required, as we will
see) via importance sampling based techniques.</p>
<p>We will not go into the full details of our derivation in this article,
however, we mention the optimization problem used to obtain a form for this
optimal distribution and encourage readers interested in the theory to checkout
Section 4 in our paper. In this optimization problem, our goal is to minimize
a measure of corrective feedback, given by <em>value</em> error <script type="math/tex">\mathcal{E}_k</script>,
with respect to the distribution <script type="math/tex">p_k</script> used for Bellman error minimization,
at every iteration <script type="math/tex">k</script>. This gives rise to the following problem:</p>
<script type="math/tex; mode=display">\min _{p_{k}} \; \mathbb{E}_{d^{\pi_{k}}}[|Q_{k}-Q^{*}|]</script>
<script type="math/tex; mode=display">\text { s.t. }\;\; Q_{k}=\arg \min _{Q} \mathbb{E}_{p_{k}}\left[\left(Q-\mathcal{B}^{*} Q_{k-1}\right)^{2}\right]</script>
<p>We show in our paper that the solution of this optimization problem, that we
refer to as the optimal distribution, <script type="math/tex">p_k^*</script>, is given by:</p>
<script type="math/tex; mode=display">p_{k}^*(s, a) \propto \exp \left(-\left|Q_{k}-Q^{*}\right|(s, a)\right) \frac{\left|Q_{k}-\mathcal{B}^{*} Q_{k-1}\right|(s, a)}{\lambda^{*}}</script>
<p>By simplifying this expression, we obtain a practically viable expression for
weights, <script type="math/tex">w_k</script>, at any iteration <script type="math/tex">k</script> that can be used to re-weight the data
distribution:</p>
<script type="math/tex; mode=display">w_{k}(s, a) \propto \exp \left(-\frac{\gamma \mathbb{E}_{s'|s, a} \mathbb{E}_{a' \sim \pi_\phi(\cdot|s')} \Delta_{k-1}(s', a')}{\tau}\right)</script>
<p>where <script type="math/tex">\Delta_k</script> is the accumulated Bellman error over iterations, and it
satisfies a convenient recursion making it amenable to practical
implementations,</p>
<script type="math/tex; mode=display">\Delta_{k}(s, a) =\left|Q_{k}-\mathcal{B}^{*} Q_{k-1}\right|(s, a) +\gamma \mathbb{E}_{s'|s, a} \mathbb{E}_{a' \sim \pi_\phi(\cdot|s')} \Delta_{k-1}(s', a')</script>
<p>and <script type="math/tex">\pi_\phi</script> is the Boltzmann or <script type="math/tex">\varepsilon-</script>greedy policy
corresponding to the current Q-function.</p>
<p>What does this expression for <script type="math/tex">w_k</script> intuitively correspond to? Observe that
the term appearing in the exponent in the expression for <script type="math/tex">w_k</script> corresponds to
the accumulated Bellman error in the target values. Our choice of <script type="math/tex">w_k</script>,
thus, basically down-weights transitions with highly incorrect target values.
This technique falls into a broader class of <strong>abstention</strong> based techniques
that are common in supervised learning settings with noisy labels, where
down-weighting datapoints (transitions here) with errorful labels (target
values here) can boost generalization and correctness properties of the learned
model.</p>
<!--
![Figure 4: Schematic of the DisCor algorithm. Transitions with errorful target values are downweighted.](https://paper-attachments.dropbox.com/s_B1B66C162F9CE3C53CD9315CF7237DAB864CB3AAEED950E42789F193B67C60EB_1584342258383_discor_schematic.png)
-->
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_B1B66C162F9CE3C53CD9315CF7237DAB864CB3AAEED950E42789F193B67C60EB_1584342258383_discor_schematic.png" width="" />
<br />
<i>
Figure 4: Schematic of the DisCor algorithm. Transitions with errorful target
values are downweighted.
</i>
</p>
<p>Why does our choice of <script type="math/tex">\Delta_k</script>, i.e. the sum of accumulated Bellman errors
suffice? This is because this value <script type="math/tex">\Delta_k</script> accounts for how error is
propagated in ADP methods. Bellman errors, <script type="math/tex">|Q_k - \mathcal{B}^*Q_{k-1}|</script> are
propagated under the current policy <script type="math/tex">\pi_{k-1}</script>, and then discounted when
computing target values for updates in ADP. <script type="math/tex">\Delta_k</script> captures exactly this,
and therefore, using this estimate in our weights suffices.</p>
<p>Our practical algorithm, that we refer to as <strong>D</strong>is<strong>C</strong>or (Distribution
Correction), is identical to conventional ADP methods like Q-learning, with the
exception that it performs a weighted Bellman backup – it assigns a weight
<script type="math/tex">w_k(s,a)</script> to a transition, <script type="math/tex">(s, a, r, s')</script> and performs a Bellman backup
weighted by these weights, as shown below.</p>
<script type="math/tex; mode=display">Q_k \leftarrow \arg \min_Q \frac{1}{N} \sum_{i=1}^N w_i(s, a) \cdot \left(Q(s, a) - [r(s, a) + \gamma Q_{k-1}(s', a')]\right)^2</script>
<p>We depict the general principle in the schematic diagram shown in Figure 4.</p>
<h1 id="how-does-discor-perform-in-practice">How does DisCor perform in practice?</h1>
<p>We finally present some results that demonstrate the efficacy of our method,
DisCor, in practical scenarios. Since DisCor only modifies the chosen
distribution for the Bellman update, it can be applied on top of any standard
ADP algorithm including soft actor-critic (SAC) or deep Q-network (DQN). Our
paper presents results for a number of tasks spanning a wide variety of
settings including robotic manipulation tasks, multi-task reinforcement
learning tasks, learning with stochastic and noisy rewards, and Atari games.
In this blog post, we present two of these results from robotic manipulation
and multi-task RL.</p>
<ol>
<li>
<p><strong>Robotic manipulation tasks.</strong> On six challenging benchmark tasks from the
<a href="https://meta-world.github.io/">MetaWorld</a> suite, we observe that DisCor
when combined with SAC greatly outperforms prior state-of-the-art RL
algorithms such as soft actor-critic (SAC) and prioritized experience replay
(<a href="https://arxiv.org/abs/1511.05952">PER</a>) which is a prior method that
prioritizes states with high Bellman error during training. Note that DisCor
usually starts learning earlier than other methods compared to. DisCor
outperforms vanilla SAC by a factor of about <strong>50%</strong> on average, in terms of
success rate on these tasks.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_B1B66C162F9CE3C53CD9315CF7237DAB864CB3AAEED950E42789F193B67C60EB_1583776003015_Screen+Shot+2020-03-09+at+10.45.54+AM.png" width="" />
<br />
</p>
</li>
<li>
<p><strong>Multi-task reinforcement learning.</strong> We also present certain results on
the Multi-task 10 (MT10) and Multi-task 50 (MT50) benchmarks from the
Meta-world suite. The goal here is to learn a single policy that can solve a
number of (10 or 50, respectively) different manipulation tasks that share
common structure. We note that DisCor outperforms, state-of-the-art SAC
algorithm on both of these benchmarks by a wide margin (for e.g. <strong>50%</strong> on
MT10, success rate). Unlike the learning process of SAC that tends to
plateau over the course of learning, we observe that DisCor always exhibits
a non-zero gradient for the learning process, until it converges.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_B1B66C162F9CE3C53CD9315CF7237DAB864CB3AAEED950E42789F193B67C60EB_1584342410332_Screen+Shot+2020-03-16+at+12.06.22+AM.png" width="" />
<br />
</p>
</li>
</ol>
<!--
*
-->
<p>In our paper, we also perform evaluations on other domains such as Atari games
and OpenAI gym benchmarks, and we encourage the readers to check those out. We
also perform an analysis of the method on tabular domains, understanding
different aspects of the method.</p>
<h1 id="perspectives-future-work-and-open-problems">Perspectives, Future Work and Open Problems</h1>
<p>Some of <a href="https://arxiv.org/abs/1906.00949">our</a> and
<a href="https://arxiv.org/abs/1712.06924">other</a> prior work has highlighted the impact
of the data distribution on the performance of ADP algorithms, We observed in
another <a href="https://arxiv.org/abs/1902.10250">prior work</a> that in contrast to the
intuitive belief about the efficacy of online Q-learning with on-policy data
collection, Q-learning with a uniform distribution over states and actions
seemed to perform best. Obtaining a uniform distribution over state-action
tuples during training is not possible in RL, unless all states and actions are
observed at least once, which may not be the case in a number of scenarios. We
might also ask the question about whether the uniform distribution is the best
choice that can be used in an RL setting? The form of the optimal distribution
derived in Section 4 of our paper, is a potentially better choice since it is
customized to the MDP under consideration.</p>
<p>Furthermore, in the domain of purely offline reinforcement learning, studied in
<a href="https://arxiv.org/abs/1906.00949">our</a> prior work and some other works, such
as <a href="https://arxiv.org/abs/1712.06924">this</a> and
<a href="https://arxiv.org/abs/1907.04543">this</a>, we observe that the data distribution
is again a central feature, where backing up out-of-distribution actions and
the inability to try these actions out in the environment to obtain answers to
counterfactual queries, can cause error accumulation and backups to diverge.
However, in this work, we demonstrate a somewhat counterintuitive finding: even
with on-policy data collection, where the algorithm, in principle, can evaluate
all forms of counterfactual queries, the algorithm may not obtain a steady
learning progress, due to an undesirable interaction between the data
distribution and generalization effects of the function approximator.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_B1B66C162F9CE3C53CD9315CF7237DAB864CB3AAEED950E42789F193B67C60EB_1583852356782_batchgraph.png" width="" />
<br />
</p>
<h2 id="what-might-be-a-few-promising-directions-to-pursue-in-future-work">What might be a few promising directions to pursue in future work?</h2>
<p><strong>Formal analysis of learning dynamics:</strong> While our study is an initial foray
into the role that data distributions play in the learning dynamics of ADP
algorithms, this motivates a significantly deeper direction of future study. We
need to answer questions related to how deep neural network based function
approximators actually behave, which are behind these ADP methods, in order to
get them to enjoy corrective feedback.</p>
<p><strong>Re-weighting to supplement exploration in RL problems:</strong> Our work depicts the
promise of re-weighting techniques as a practically simple replacement for
altering entire exploration strategies. We believe that re-weighting techniques
are very promising as a general tool in our toolkit to develop RL algorithms.
In an online RL setting, re-weighting can help remove the some of the burden
off exploration algorithms, and can thus, potentially help us employ complex
exploration strategies in RL algorithms.</p>
<p>More generally, we would like to make a case of analyzing effects of data
distribution more deeply in the context of deep RL algorithms. It is well known
that narrow distributions can lead to brittle solutions in supervised learning
that also do not generalize. What is the corresponding analogue in
reinforcement learning? Distributional robustness style techniques have been
used in supervised learning to guarantee a uniformly convergent learning
process, but it still remains unclear how to apply these in an RL with function
approximation setting. Part of the reason is that the theory of RL often
derives from tabular settings, where distributions do not hamper the learning
process to the extent they do with function approximation. However, as we
showed in this work, choosing the right distribution may lead to significant
gains in deep RL methods, and therefore, we believe, that this issue should be
studied in more detail.</p>
<hr />
<p>This blog post is based on our recent paper:</p>
<ul>
<li><strong>DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction</strong> <br />
Aviral Kumar, Abhishek Gupta, Sergey Levine <br />
<a href="https://arxiv.org/abs/2003.07305">arXiv</a></li>
</ul>
<p>We thank Sergey Levine and Marvin Zhang for their valuable feedback on this blog post.</p>
Mon, 16 Mar 2020 02:00:00 -0700
https://bairblog.github.io/2020/03/16/discor/
https://bairblog.github.io/2020/03/16/discor/BADGR:<br>The Berkeley Autonomous Driving Ground Robot<!--
Be careful that these three lines are at the top, and that the title and image change for each blog post!
-->
<meta name="twitter:title" content="BADGR: The Berkeley Autonomous Driving Ground Robot" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/badgr/image_09.jpg" />
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/badgr/image_00.jpg" width="400" />
<img src="https://bair.berkeley.edu/static/blog/badgr/image_01.jpg" width="400" />
<br />
<i>
</i>
</p>
<p>Look at the images above. If I asked you to bring me a picnic blanket in the
grassy field, would you be able to? Of course. If I asked you to bring over a
cart full of food for a party, would you push the cart along the paved path or
on the grass? Obviously the paved path.</p>
<!--more-->
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/badgr/image_02.gif" width="400" />
<img src="https://bair.berkeley.edu/static/blog/badgr/image_03.gif" width="400" />
<br />
<i>
Prior navigation approaches based purely on geometric reasoning incorrectly
think that tall grass is an obstacle (left) and don’t understand the difference
between a smooth paved path and bumpy grass (right).
</i>
</p>
<p>While the answers to these questions may seem obvious, today’s mobile robots
would likely fail at these tasks: they would think the tall grass is the same
as a concrete wall, and wouldn’t know the difference between a smooth path and
bumpy grass. This is because most mobile robots think purely in terms of
geometry; they detect where obstacles are, and plan paths around these
perceived obstacles in order to reach the goal. This purely geometric view of
the world is insufficient for many navigation problems. Geometry is simply not
enough.</p>
<p>Can we enable robots to reason about navigational affordances directly from
images? We developed a robot that can autonomously learn about physical
attributes of the environment through its own experiences in the real-world,
without any simulation or human supervision. We call our robot learning system
BADGR: the Berkeley Autonomous Driving Ground Robot.</p>
<p>BADGR works by:</p>
<ol>
<li>autonomously collecting data</li>
<li>automatically labelling the data with self-supervision</li>
<li>training an image-based neural network predictive model</li>
<li>using the predictive model to plan into the future and execute actions that will lead the robot to accomplish the desired navigational task</li>
</ol>
<h3 id="1-data-collection">(1) Data Collection</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/badgr/image_04.gif" width="" />
<br />
<i>
BADGR autonomously collecting data in off-road (left) and urban (right)
environments.
</i>
</p>
<p>BADGR needs a large amount of diverse data in order to successfully learn how
to navigate. The robot collects data using a simple time-correlated <a href="https://en.wikipedia.org/wiki/Random_walk">random
walk</a> controller. As the robot collects data, if it experiences a collision
or gets stuck, the robot executes a simple reset controller and then continues
collecting data.</p>
<h3 id="2-self-supervised-data-labelling">(2) Self-Supervised Data Labelling</h3>
<p>BADGR then goes through the data and calculates labels for specific
navigational events, such as the robot’s position and if the robot collided or
is driving over bumpy terrain, and adds these event labels back into the
dataset. These events are labelled by having a person write a short snippet of
code that maps the raw sensor data to the corresponding label. As an example,
the code snippet for determining if the robot is on bumpy terrain looks at the
<a href="https://en.wikipedia.org/wiki/Inertial_measurement_unit">IMU</a> sensor and labels the terrain as bumpy if the angular velocity
magnitudes are large.</p>
<p>We describe this labelling mechanism as self-supervised because although a
person has to manually write this code snippet, the code snippet can be used to
label all existing and future data without any additional human effort.</p>
<h3 id="3-neural-network-predictive-model">(3) Neural Network Predictive Model</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/badgr/image_05.png" width="" />
<br />
<i>
The neural network predictive model at the core of BADGR.
</i>
</p>
<p>BADGR then uses the data to train a deep neural network predictive model. The
neural network takes as input the current camera image and a future sequence of
planned actions, and outputs predictions of the future relevant events (such as
if the robot will collide or drive over bumpy terrain). The neural network
predictive model is trained to predict these future events as accurately as
possible.</p>
<h3 id="4-planning-and-navigating">(4) Planning and Navigating</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/badgr/image_06.gif" width="" />
<br />
<i>
BADGR predicting which actions lead to bumpy terrain (left) or collisions
(right).
</i>
</p>
<p>When deploying BADGR, the user first defines a reward function that encodes the
specific task they want the robot to accomplish. For example, the reward
function could encourage driving towards a goal while discouraging collisions
or driving over bumpy terrain. BADGR then uses the trained predictive model,
current image observation, and reward function to plan a sequence of actions
that maximize reward. The robot executes the first action in this plan, and
BADGR continues to alternate between planning and executing until the task is
complete.</p>
<hr />
<p>In our experiments, we studied how BADGR can learn about physical attributes of
the environment at <a href="https://rfs-env.berkeley.edu/home">a large off-site facility near UC Berkeley</a>. We compared
our approach to a geometry-based policy that uses <a href="https://en.wikipedia.org/wiki/Lidar">LIDAR</a> to plan
collision-free paths. (But note that BADGR only uses the onboard camera.)</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/badgr/image_07.gif" width="" />
<br />
<i>
BADGR successfully reaches the goal while avoiding collisions and bumpy
terrain, while the geometry-based policy is unable to avoid bumpy terrain.
</i>
</p>
<p>We first considered the task of reaching a goal GPS location while avoiding
collisions and bumpy terrain in an urban environment. Although the
geometry-based policy always succeeded in reaching the goal, it failed to avoid
the bumpy grass. BADGR also always succeeded in reaching the goal, and
succeeded in avoiding bumpy terrain by driving on the paved paths. Note that we
never told the robot to drive on paths; BADGR automatically learned from the
onboard camera images that driving on concrete paths is smoother than driving
on the grass.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/badgr/image_08.gif" width="" />
<br />
<i>
BADGR successfully reaches the goal while avoiding collisions, while the
geometry-based policy is unable to make progress because it falsely believes
the grass are untraversable obstacles.
</i>
</p>
<p>We also considered the task of reaching a goal GPS location while avoiding both
collisions and getting stuck in an off-road environment. The geometry-based
policy nearly never crashed or became stuck on grass, but sometimes refused to
move because it was surrounded by grass which it incorrectly labelled as
untraversable obstacles. BADGR almost always succeeded in reaching the goal by
avoiding collisions and getting stuck, while not falsely predicting that all
grass was an obstacle. This is because BADGR learned from experience that most
grass is in fact traversable.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/badgr/image_09.jpg" width="" />
<br />
<i>
BADGR improving as it gathers more data.
</i>
</p>
<p>In addition to being able to learn about physical attributes of the
environment, a key aspect of BADGR is its ability to continually self-supervise
and improve the model as it gathers more and more data. To demonstrate this
capability, we ran a controlled study in which BADGR gathers and trains on data
from one area, moves to a new target area, fails at navigating in this area,
but then eventually succeeds in the target area after gathering and training on
additional data from that area.</p>
<p>This experiment not only demonstrates that BADGR can improve as it gathers more
data, but also that previously gathered experience can actually accelerate
learning when BADGR encounters a new environment. And as BADGR autonomously
gathers data in more and more environments, it should take less and less time
to successfully learn to navigate in each new environment.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/badgr/image_10.gif" height="150" />
<img src="https://bair.berkeley.edu/static/blog/badgr/image_11.gif" height="150" />
<img src="https://bair.berkeley.edu/static/blog/badgr/image_12.gif" height="150" />
<br />
<i>
BADGR navigating in novel environments.
</i>
</p>
<p>We also evaluated how well BADGR navigates in novel environments—ranging from
a forest to urban buildings—not seen in the training data. This result
demonstrates that BADGR can generalize to novel environments if it gathers and
trains on a sufficiently large and diverse dataset.</p>
<hr />
<p>The key insight behind BADGR is that by autonomously learning from experience
directly in the real world, BADGR can learn about navigational affordances,
improve as it gathers more data, and generalize to unseen environments.
Although we believe BADGR is a promising step towards a fully automated,
self-improving navigation system, there are a number of open problems which
remain: how can the robot safely gather data in new environments? adapt online
as new data streams in? cope with non-static environments, such as humans
walking around? We believe that solving these and other challenges is crucial
for enabling robot learning platforms to learn and act in the real world.</p>
<p>This post is based on the following paper:</p>
<ul>
<li><a href="https://people.eecs.berkeley.edu/~gregoryk/">Gregory Kahn</a>, <a href="https://people.eecs.berkeley.edu/~pabbeel/">Pieter Abbeel</a>, <a href="https://people.eecs.berkeley.edu/~svlevine/">Sergey Levine</a><br />
<strong><a href="https://arxiv.org/pdf/2002.05700.pdf">BADGR: An Autonomous Self-Supervised Learning-Based Navigation System</a></strong><br />
<a href="https://sites.google.com/view/badgr">Website</a><br />
<a href="https://www.youtube.com/watch?v=EMV0zEXbcc4&feature=youtu.be">Video</a><br />
<a href="https://github.com/gkahn13/badgr">Code</a><br /></li>
</ul>
<p>I would like to thank Sergey Levine for feedback while writing this blog post.</p>
Thu, 12 Mar 2020 02:00:00 -0700
https://bairblog.github.io/2020/03/12/badgr/
https://bairblog.github.io/2020/03/12/badgr/Speeding Up Transformer Training and Inference By <i>Increasing</i> Model Size<meta name="twitter:title" content="Speeding Up Transformer Training By Increasing Model Size" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/compress/Flowchart.png" />
<h1 id="model-training-can-be-slow">Model Training Can Be Slow</h1>
<p>In deep learning, using more compute (e.g., increasing model size, dataset
size, or training steps) often leads to higher accuracy. This is especially
true given the recent success of unsupervised pretraining methods like
<a href="https://arxiv.org/abs/1810.04805">BERT</a>, which can scale up training to very large models and datasets.
Unfortunately, large-scale training is very computationally expensive,
especially without the hardware resources of large industry research labs.
Thus, the goal in practice is usually to get high accuracy without exceeding
one’s hardware budget and training time.</p>
<p>For most training budgets, very large models appear impractical. Instead, the
go-to strategy for maximizing training efficiency is to use models with small
hidden sizes or few layers because these models run faster and use less memory.</p>
<!--more-->
<h1 id="larger-models-train-faster">Larger Models Train Faster</h1>
<p>However, in our <a href="https://arxiv.org/abs/2002.11794">recent paper</a>, we show that this common practice of
reducing model size is actually the opposite of the best compute-efficient
training strategy. Instead, when training <a href="https://arxiv.org/abs/1706.03762">Transformer</a> models on a budget,
you want to drastically <em>increase model size but stop training very early</em>. In
other words, we rethink the implicit assumption that models must be trained
<em>until convergence</em> by demonstrating the opportunity to increase model size
while sacrificing convergence.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/compress/Flowchart.png" />
<br />
<i>
</i>
</p>
<p>This phenomenon occurs because larger models converge to lower test error in
fewer gradient updates than smaller models. Moreover, this increase in
convergence outpaces the extra computational cost of using larger models.
Consequently, when considering wall-clock training time, larger models achieve
higher accuracy faster.</p>
<p>We demonstrate this trend in the two training curves below. On the left, we
plot the validation error for pretraining <a href="https://arxiv.org/abs/1907.11692">RoBERTa</a>, a variant of BERT. The
deeper RoBERTa models achieve lower <a href="https://en.wikipedia.org/wiki/Perplexity">perplexity</a> than the shallower models
for a given wall clock time (our paper shows the same is true for wider
models). This trend also holds for machine translation. On the right, we plot
the validation BLEU score (higher is better) when training an English-French
Transformer machine translation model. The deeper and wider models achieve
higher BLEU score than smaller models given the same training time.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/compress/roberta_different_depths_wall_clock.png" width="400" />
<img src="https://bair.berkeley.edu/static/blog/compress/machine_translation_wall_clock.png" width="400" />
<br />
<i>
</i>
</p>
<p>Interestingly, for pretraining RoBERTa, increasing model width and/or depth
both lead to faster training. For machine translation, wider models outperform
deeper models. We thus recommend to try increasing width before going deeper.</p>
<p>We also recommend <em>increasing model size, not batch size</em>. Concretely, we
confirm that once the batch size is near a <a href="https://arxiv.org/abs/1812.06162">critical range</a>, increasing the
batch size only provides marginal improvements in wall-clock training time.
Thus, when under resource constraints, we recommend to use a batch size inside
this critical region and then to use larger model sizes.</p>
<h1 id="but-what-about-test-time">But What About Test Time?</h1>
<p>Although larger models are more <em>training-efficient</em>, they also increase the
computational and memory requirements of <em>inference</em>. This is problematic
because the total cost of inference is much larger than the cost of training
for most real-world applications. However, for RoBERTa, we show that this
trade-off can be reconciled with model compression. In particular, larger
models are more robust to model compression techniques than small models. Thus,
one can get the best of both worlds by <em>training very large models and then
heavily compressing them</em>.</p>
<p>We use the compression methods of quantization and pruning. Quantization stores
model weights in low precision formats; pruning sets certain neural network
weights to zero. Both methods can reduce the inference latency and memory
requirements of storing model weights.</p>
<p>We first pretrain RoBERTa models of different sizes for the same <em>total
wall-clock time</em>. We then finetune these models on a downstream text
classification task (MNLI) and apply either pruning or quantization. We find
that the best models for a given test-time budget are the models that are
trained very large and then heavily compressed.</p>
<p>For example, consider the pruning results for the deepest model (orange curve
in the left Figure below). Without pruning the model, it reaches high accuracy
but uses about 200 million parameters (and thus lots of memory and compute).
However, this model can be heavily pruned (the points moving to the left along
the curve) without considerably hurting accuracy. This is in stark contrast to
the smaller models, e.g., the 6 layer model shown in pink, whose accuracy
heavily degrades after pruning. A similar trend occurs for quantization (right
Figure below). Overall, the best model for most test budgets (pick a point on
the x-axis) are the very large but heavily compressed models.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/compress/wc_match_pruning_depth_only_accuracy_memory_plot.png" width="400" />
<img src="https://bair.berkeley.edu/static/blog/compress/wc_match_quantization_accuracy_memory_plot.png" width="400" />
<br />
<i>
</i>
</p>
<h1 id="conclusion">Conclusion</h1>
<p>We have shown that increasing Transformer model size can improve the efficiency
of training and inference, i.e., one should <em>Train Large, Then Compress</em>. This
finding leads to many other interesting questions such as <em>why</em> larger models
converge faster and compress better. In <a href="https://arxiv.org/abs/2002.11794">our paper</a>, we present initial
investigations into this phenomenon, however, future work is still required.
Moreover, our findings are currently specific to NLP—we would like to explore
how these conclusions generalize to other domains like computer vision.</p>
<p>Contact Eric Wallace on <a href="https://twitter.com/Eric_Wallace_">Twitter</a>. Thanks to Zhuohan Li, Kevin Lin, and
Sheng Shen and for their feedback on this post.</p>
<p>See our paper “<a href="https://arxiv.org/abs/2002.11794">Train Large, Then Compress: Rethinking Model Size for Efficient
Training and Inference of Transformers</a>” by Zhuohan Li*, Eric Wallace*,
Sheng Shen*, Kevin Lin*, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez.</p>
Thu, 05 Mar 2020 01:00:00 -0800
https://bairblog.github.io/2020/03/05/compress/
https://bairblog.github.io/2020/03/05/compress/Large Scale Training at BAIR with Ray Tune<meta name="twitter:title" content="Large Scale Training at BAIR" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/tune/tune-arch-simple.png" />
<p>In this blog post, we share our experiences in developing two critical software
libraries that many <a href="https://bair.berkeley.edu/">BAIR</a> researchers use to execute large-scale AI
experiments: <a href="https://ray.readthedocs.io/en/latest/tune.html">Ray Tune</a> and the <a href="https://ray.readthedocs.io/en/latest/autoscaling.html">Ray Cluster Launcher</a>, both of which now
back many popular open-source AI libraries.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/tune/naive-tuning.png" width="800" />
<br />
<i>
</i>
</p>
<p>As AI research becomes more compute intensive, many AI researchers have become
squeezed for time and resources. Many researchers now rely on cloud providers
like Amazon Web Services or Google Compute Platform to access the huge amounts
of computational resources necessary for training large models.</p>
<!--more-->
<h1 id="understanding-research-infrastructure">Understanding research infrastructure</h1>
<p>To put things into perspective, let’s first take a look at a standard machine
learning workflow in industry (Figure 1).</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/tune/tune_blog_1.png" width="500" />
<br />
<i>
Figure 1 represents the typical ML model development workflow in
<b>industry</b>.
</i>
</p>
<p>The typical <em>research</em> workflow is actually a tight loop between steps 2 and 3, generalizing to something like
Figure 2.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/tune/tune_blog_2.png" width="500" />
<br />
<i>
Figure 2 represents the typical ML model development workflow
for <b>research</b>. The research workflow is typically a subsection of the
industry workflow.
</i>
</p>
<p>The research workflow is very much an iterative process and is often bottlenecked by
the <em>experiment execution</em> step (Figure 2, B). Typically, an “experiment”
consists of multiple training jobs, or “trials”, where each trial is a job that
trains a single model. Each trial might train a model using a different set of
configuration parameters (hyperparameters) or a different seed.</p>
<p>At Berkeley, we saw that AI researchers moving to the cloud spent a lot of time
writing their own experiment execution tools that wrangled cloud-provider APIs
for starting instances, setting up dependencies, and launching experiments.</p>
<p>Unfortunately, despite the vast amounts of time poured into developing these
tools, these ad-hoc solutions are typically limited in functionality:</p>
<ul>
<li>
<p><strong>Simplistic Architecture</strong>: Each trial is typically launched on a separate
node without any centralized control logic. This makes it difficult for
researchers to implement optimization techniques such as <a href="https://deepmind.com/blog/article/population-based-training-neural-networks">Population-based
Training</a> or <a href="https://github.com/fmfn/BayesianOptimization">Bayesian Optimization</a> that require coordination between
different runs.</p>
</li>
<li>
<p><strong>Lack of Fault Handling</strong>: The results of training jobs are lost forever if
an instance fails. Researchers often manually manage the failover by tracking
live experiments on a spreadsheet, but this is both time consuming and
error-prone.</p>
</li>
<li>
<p><strong>No Spot Instance Discount</strong>: The lack of fault tolerance capabilities also
means forgoing the spot instance discount (<a href="https://aws.amazon.com/ec2/spot/">up to 90</a>%) that cloud
providers offer.</p>
</li>
</ul>
<p>To sum up, instrumenting and managing distributed experiments on cloud
resources is both burdensome and difficult to get right. As such, easy-to-use
frameworks that bridge the gap between execution and research can greatly
accelerate the research process. Members from a couple of labs across BAIR
collaborated to build two complementary tools for AI experimentation in the
cloud:</p>
<p><a href="https://ray.readthedocs.io/en/latest/tune.html">Ray Tune</a>: a fault-tolerant framework for training and hyperparameter
tuning. Specifically, Ray Tune (or “Tune” for short):</p>
<ol>
<li>Coordinates among parallel jobs to enable <strong>parallel hyperparameter optimization</strong>.</li>
<li><strong>Automatically checkpoints and resumes</strong> training jobs in case of machine failures.</li>
<li>Offers many state-of-the-art hyperparameter search algorithms such as
<a href="https://ray.readthedocs.io/en/latest/tune-schedulers.html">Population-based Training and HyperBand</a>.</li>
</ol>
<p><a href="https://ray.readthedocs.io/en/latest/autoscaling.html">Ray Cluster Launcher</a>: a utility for managing resource provisioning and
cluster configurations across AWS, GCP, and Kubernetes.</p>
<h1 id="ray-tune">Ray Tune</h1>
<p>Tune was built to address the shortcomings of these ad-hoc experiment execution
tools. This was done by leveraging the <strong>Ray Actor API</strong> and adding <strong>failure
handling</strong>.</p>
<h2 id="actor-based-training">Actor-based Training</h2>
<p>Many techniques for hyperparameter optimization
require a framework that monitors the metrics of all concurrent training jobs
and controls the training execution. To address this, Tune uses a master-worker
architecture to centralize decision-making and communicates with its
distributed workers using the Ray Actor API.</p>
<p><em>What is the Ray Actor API?</em> Ray <a href="https://ray.readthedocs.io/en/latest/actors.html">provides an API</a> to create an “actor” from a Python
class. This enables classes and objects to be used in parallel and distributed
settings.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/tune/tune-arch-simple.png" width="500" />
<br />
<i>
</i>
</p>
<p>Tune uses a <em><a href="https://ray.readthedocs.io/en/latest/tune-usage.html#tune-training-api">Trainable</a></em> class interface to define an actor class
specifically for training models. This interface exposes methods such as
<code class="language-plaintext highlighter-rouge">_train</code>, <code class="language-plaintext highlighter-rouge">_stop</code>, <code class="language-plaintext highlighter-rouge">_save</code>, and <code class="language-plaintext highlighter-rouge">_restore</code>, which allows Tune to monitor
intermediate training metrics and kill low-performing trials.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">NewTrainable</span><span class="p">(</span><span class="n">tune</span><span class="o">.</span><span class="n">Trainable</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">_setup</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">_train</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Run 1 step of training (e.g., one epoch).
Returns:
A dict of training metrics.
"""</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">_save</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">checkpoint_dir</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">_restore</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">checkpoint_path</span><span class="p">):</span>
<span class="o">...</span>
</code></pre></div></div>
<p>More importantly, by leveraging the Actor API, we can implement parallel
hyperparameter optimization methods in Tune such as <a href="https://ray.readthedocs.io/en/latest/tune-schedulers.html#asynchronous-hyperband">HyperBand</a> and Parallel
Bayesian Optimization which was not possible with prior experiment execution
tools that researchers used.</p>
<h2 id="fault-tolerance">Fault Tolerance</h2>
<p>Cloud providers often offer “preemptible instances” (i.e.,
spot instances) at a significant discount. The steep discounts allow
researchers to lower their cloud computing costs significantly. However, the
downside is that the cloud provider can terminate or stop your machine at any
time, causing you to lose training progress.</p>
<p>To enable spot instance usage, we built Tune to automatically checkpoint and
resume training jobs across different machines in the cluster so that
experiments would be resilient to preemption and cluster resizing.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Tune will resume training jobs from the last checkpoint
# even if machines are removed.
</span>
<span class="n">analysis</span> <span class="o">=</span> <span class="n">tune</span><span class="o">.</span><span class="n">run</span><span class="p">(</span>
<span class="n">NewTrainable</span><span class="p">,</span>
<span class="n">checkpoint_freq</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="c1"># Checkpoint every 5 epochs
</span> <span class="n">config</span><span class="o">=</span><span class="p">{</span><span class="s">"lr"</span><span class="p">:</span> <span class="n">tune</span><span class="o">.</span><span class="n">grid_search</span><span class="p">([</span><span class="mf">0.001</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">])},</span>
<span class="p">)</span>
</code></pre></div></div>
<h3 id="how-does-it-work">How does it work?</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/tune/tune-ft-graphic.jpg" width="700" />
<br />
</p>
<p>If a node is lost while a training job is still executing
on that node and a checkpoint of the training job (trial) exists, Tune will
wait until available resources are available to begin executing the trial
again.</p>
<p>If the trial is placed on a different node, Tune will automatically push the
previous checkpoint file to that node and restore the state, allowing the trial
to resume from the latest checkpoint <em>even after failure</em>.</p>
<h1 id="ray-cluster-launcher">Ray Cluster Launcher</h1>
<p>Above, we described the pains of wrangling cloud-provider APIs to automate the
cluster setup process. However, even with a tool for spinning up clusters,
researchers still have to go through a tedious workflow to run experiments:</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/tune/training-experiment-steps.png" width="700" />
<br />
<i>
</i>
</p>
<p>To simplify this, we built the Ray Cluster Launcher, a tool <a href="https://ray.readthedocs.io/en/latest/autoscaling.html">for provisioning
and autoscaling resources</a> and starting a Ray cluster on AWS EC2, GCP, and
Kubernetes. We then abstracted all of the above steps for running an experiment
into <em>just a short configuration file and a single command</em>:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># CLUSTER.yaml</span>
<span class="na">cluster_name</span><span class="pi">:</span> <span class="s">tune-default</span>
<span class="na">provider</span><span class="pi">:</span> <span class="pi">{</span><span class="nv">type</span><span class="pi">:</span> <span class="nv">aws</span><span class="pi">,</span> <span class="nv">region</span><span class="pi">:</span> <span class="nv">us-west-2</span><span class="pi">}</span>
<span class="na">auth</span><span class="pi">:</span> <span class="pi">{</span><span class="nv">ssh_user</span><span class="pi">:</span> <span class="nv">ubuntu</span><span class="pi">}</span>
<span class="na">min_workers</span><span class="pi">:</span> <span class="m">0</span>
<span class="na">max_workers</span><span class="pi">:</span> <span class="m">2</span>
<span class="c1"># Deep Learning AMI (Ubuntu) Version 21.0</span>
<span class="na">head_node</span><span class="pi">:</span> <span class="pi">{</span>
<span class="nv">InstanceType</span><span class="pi">:</span> <span class="nv">c4.2xlarge</span><span class="pi">,</span>
<span class="nv">ImageId</span><span class="pi">:</span> <span class="nv">ami-0b294f219d14e6a82</span><span class="pi">}</span>
<span class="na">worker_nodes</span><span class="pi">:</span> <span class="pi">{</span>
<span class="nv">InstanceType</span><span class="pi">:</span> <span class="nv">c4.2xlarge</span><span class="pi">,</span>
<span class="nv">ImageId</span><span class="pi">:</span> <span class="nv">ami-0b294f219d14e6a82</span><span class="pi">}</span>
<span class="na">setup_commands</span><span class="pi">:</span> <span class="c1"># Set up each node.</span>
<span class="pi">-</span> <span class="s">pip install ray numpy pandas</span>
<span class="na">file_mounts</span><span class="pi">:</span> <span class="pi">{</span>
<span class="s1">'</span><span class="s">/home/ubuntu/files'</span><span class="pi">:</span><span class="s1">'</span><span class="s">my_files/'</span><span class="pi">,</span>
<span class="pi">}</span>
</code></pre></div></div>
<p>The below command starts a cluster, uploads and runs a script for distributed
hyperparameter tuning, then shuts down the cluster.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>ray submit CLUSTER.yaml <span class="nt">--start</span> <span class="nt">--stop</span> tune_experiment.py <span class="se">\</span>
<span class="nt">--args</span><span class="o">=</span><span class="s2">"--address=auto"</span>
</code></pre></div></div>
<p>Researchers are now using both Ray Tune and the Ray Cluster Launcher to launch
hundreds of parallel jobs at once across dozens of GPU machines at once. <a href="https://ray.readthedocs.io/en/latest/tune-distributed.html">The
Ray Tune documentation page for distributed experiments</a> shows you how you
can do this too.</p>
<h1 id="putting-things-together">Putting things together</h1>
<p>Over the last year, we’ve been working with different groups across BAIR to
better allow researchers to leverage the cloud. We had to make Ray Tune and the
Ray Cluster Launcher general enough to support a large body of research
codebases while making the barrier to entry so low that anyone could try it out
in a couple of minutes.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># An example Ray Tune script for PyTorch.
</span><span class="kn">import</span> <span class="nn">torch.optim</span> <span class="k">as</span> <span class="n">optim</span>
<span class="kn">from</span> <span class="nn">ray</span> <span class="kn">import</span> <span class="n">tune</span>
<span class="kn">from</span> <span class="nn">ray.tune.examples.mnist_pytorch</span> <span class="kn">import</span> <span class="p">(</span>
<span class="n">get_data_loaders</span><span class="p">,</span> <span class="n">ConvNet</span><span class="p">,</span> <span class="n">train</span><span class="p">,</span> <span class="n">test</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">TrainMNIST</span><span class="p">(</span><span class="n">tune</span><span class="o">.</span><span class="n">Trainable</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">_setup</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">train_loader</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">test_loader</span> <span class="o">=</span> <span class="n">get_data_loaders</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">ConvNet</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="n">config</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">"lr"</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">_train</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">train</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">optimizer</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">train_loader</span><span class="p">)</span>
<span class="n">acc</span> <span class="o">=</span> <span class="n">test</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">test_loader</span><span class="p">)</span>
<span class="k">return</span> <span class="p">{</span><span class="s">"mean_accuracy"</span><span class="p">:</span> <span class="n">acc</span><span class="p">}</span>
<span class="k">def</span> <span class="nf">_save</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">checkpoint_dir</span><span class="p">):</span>
<span class="n">checkpoint_path</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">checkpoint_dir</span><span class="p">,</span> <span class="s">"model.pth"</span><span class="p">)</span>
<span class="n">torch</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">state_dict</span><span class="p">(),</span> <span class="n">checkpoint_path</span><span class="p">)</span>
<span class="k">return</span> <span class="n">checkpoint_path</span>
<span class="k">def</span> <span class="nf">_restore</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">checkpoint_path</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">load_state_dict</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">checkpoint_path</span><span class="p">))</span>
<span class="n">analysis</span> <span class="o">=</span> <span class="n">tune</span><span class="o">.</span><span class="n">run</span><span class="p">(</span>
<span class="n">TrainMNIST</span><span class="p">,</span>
<span class="n">stop</span><span class="o">=</span><span class="p">{</span><span class="s">"training_iteration"</span><span class="p">:</span> <span class="mi">50</span><span class="p">},</span>
<span class="n">config</span><span class="o">=</span><span class="p">{</span><span class="s">"lr"</span><span class="p">:</span> <span class="n">tune</span><span class="o">.</span><span class="n">grid_search</span><span class="p">([</span><span class="mf">0.001</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">])})</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Best hyperparameters: "</span><span class="p">,</span> <span class="n">analysis</span><span class="o">.</span><span class="n">get_best_config</span><span class="p">(</span>
<span class="n">metric</span><span class="o">=</span><span class="s">"mean_accuracy"</span><span class="p">))</span>
<span class="c1"># Get a dataframe for analyzing trial results.
</span><span class="n">df</span> <span class="o">=</span> <span class="n">analysis</span><span class="o">.</span><span class="n">dataframe</span><span class="p">()</span>
</code></pre></div></div>
<p>Tune has grown to become a popular open-source project for hyperparameter
tuning. It is used in many other popular research projects, ranging from
<a href="https://github.com/arcelien/pba">Population-based Data Augmentation</a>, to
<a href="https://github.com/allenai/allentune">Hyperparameter Tuning for AllenNLP</a>, to <a href="https://medium.com/riselab/scalable-automl-for-time-series-prediction-using-ray-and-analytics-zoo-b79a6fd08139">AutoML in AnalyticsZoo</a>.</p>
<p>Numerous open-source research projects from BAIR now rely on the combination of
Ray Tune and the Ray Cluster Launcher to orchestrate and execute distributed
experiments, including <a href="https://github.com/rail-berkeley/softlearning">rail-berkeley/softlearning</a>,
<a href="https://github.com/HumanCompatibleAI/adversarial-policies">HumanCompatibleAI/adversarial-policies</a>, and <a href="https://github.com/flow-project/flow">flow-project/flow</a>.</p>
<p>So go ahead and <a href="https://ray.readthedocs.io/en/latest/tune-distributed.html">try Ray Tune and Cluster Launcher yourself</a>!</p>
<h1 id="links">Links</h1>
<ul>
<li>Previous BAIR blog posts about the Ray project:
<ul>
<li><a href="https://bair.berkeley.edu/blog/2018/01/09/ray/">Ray: A Distributed System for AI</a></li>
<li><a href="https://bair.berkeley.edu/blog/2018/12/12/rllib/">Scaling Multi-Agent Reinforcement Learning</a></li>
<li><a href="https://bair.berkeley.edu/blog/2019/10/14/functional-rl/">Functional RL with Keras and Tensorflow Eager</a></li>
</ul>
</li>
<li><a href="https://ray.io/">Ray website</a></li>
<li><a href="https://github.com/ray-project/ray">Ray GitHub page</a></li>
<li>Ray documentation:
<ul>
<li><a href="https://ray.readthedocs.io/en/latest/tune.html">Tune</a></li>
<li><a href="https://ray.readthedocs.io/en/latest/autoscaling.html">Ray Cluster Launcher</a></li>
<li><a href="https://ray.readthedocs.io/en/latest/">Ray-project landing page</a></li>
<li><a href="https://ray.readthedocs.io/en/latest/installation.html">Installation</a></li>
</ul>
</li>
</ul>
Thu, 16 Jan 2020 01:00:00 -0800
https://bairblog.github.io/2020/01/16/tune/
https://bairblog.github.io/2020/01/16/tune/Emergent Behavior by Minimizing Chaos<meta name="twitter:title" content="SMiRL: Surprise Minimizing RL in Dynamic Environments" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/smirl/SMiRL_Outline.png" />
<p>All living organisms carve out environmental niches within which they can
maintain relative predictability amidst the ever-increasing entropy around them
<a href="http://www.ler.esalq.usp.br/aulas/lce1302/life_as_a_manifestation.pdf">(1)</a>,
<a href="https://www.fil.ion.ucl.ac.uk/~karl/The%20free-energy%20principle%20-%20a%20rough%20guide%20to%20the%20brain.pdf">(2)</a>.
Humans, for example, go to great lengths to shield themselves from surprise —
we band together in millions to build cities with homes, supplying water, food,
gas, and electricity to control the deterioration of our bodies and living
spaces amidst heat and cold, wind and storm. The need to discover and maintain
such surprise-free equilibria has driven great resourcefulness and skill in
organisms across very diverse natural habitats. Motivated by this, we ask:
could the motive of preserving order amidst chaos guide the automatic
acquisition of useful behaviors in artificial agents?</p>
<!--more-->
<p>How might an agent in an environment acquire complex behaviors and skills with
no external supervision? This central problem in artificial intelligence has
evoked several candidate solutions, largely focusing on novelty-seeking
behaviors
<a href="http://people.idsia.ch/~juergen/curioussingapore/curioussingapore.html">(3)</a>,
<a href="https://arxiv.org/abs/1606.01868">(4)</a>,
<a href="https://pathak22.github.io/noreward-rl/">(5)</a>. In simulated worlds,
such as video games, novelty-seeking intrinsic motivation can lead to
interesting and meaningful behavior. However, these environments may be
fundamentally lacking compared to the real world. In the real world, natural
forces and other agents offer bountiful novelty. Instead, the challenge in
natural environments is allostasis: discovering behaviors that enable agents to
maintain an equilibrium (homeostasis), for example to preserve their bodies,
their homes, and avoid predators and hunger. In the example below we shown an
example where an agent is experiencing random events due to the changing
weather. If the agent learns to build a shelter, in this case a house, the
agent will reduce the observed effects from weather.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/smirl/robotsurprise_stacked.png" width="600" />
<br />
</p>
<p>We formalize homeostasis as an objective for reinforcement learning based on
surprise minimization (SMiRL). In entropic and dynamic environments with
undesirable forms of novelty, minimizing surprise (i.e., minimizing novelty)
causes agents to naturally seek an equilibrium that can be stably maintained.</p>
<p style="text-align:center;">
<img width="600" src="https://bair.berkeley.edu/static/blog/smirl/SMiRL_Outline.png" />
<br />
</p>
<p>Here we show an illustration of the agent interaction loop using SMiRL. When
the agent observes a state $\mathbf{s}$, it computes the probability of this new state
given the belief the agent has $r_{t} \leftarrow p_{\theta_{t-1}}(\textbf{s})$.
This belief models the states the agent is most familiar with – i.e., the
distribution of states it has seen in the past. Experiencing states that are
more familiar will result in higher reward. After the agent experience a new
state it updates its belief $p_{\theta_{t-1}}(\textbf{s})$ over states to
include the most recent experience. Then, the goal of the action policy
$\pi(a|\textbf{s}, \theta_{t})$ is to choose actions that will result in the
agent consistently experiencing familiar states. Crucially, the agent
understands that its beliefs will change in the future. This means that it has
two mechanisms by which to maximize this reward: taking actions to visit
familiar states, and taking actions to visit states that will <em>change its
beliefs</em> such that future states are more familiar. It is this latter mechanism
that results in complex emergent behavior. Below, we visualize a policy trained
to play the game of Tetris. On the left the blocks the agent chooses are shown
and on the right is a visualization of $p_{\theta_{t}}(\textbf{s})$. We can see
how as the episode progresses the belief over possible locations to place
blocks tends to favor only the bottom row. This encourages the agent to
eliminate blocks to prevent board from filling up.</p>
<!--
<div class="t">
<table align="center">
<tr>
<td align="center">
<img width="200" src="https://bair.berkeley.edu/static/blog/smirl/tetris_ps.gif">
</td>
<td>
<img width="320" src="https://bair.berkeley.edu/static/blog/smirl/minigrid-maze-random-count.gif">
</td>
</tr>
<tr align=center>
<td>
Tetris
</td>
<td>
HauntedHouse
</td>
</tr>
</table>
</div>
-->
<p style="text-align:center;">
<img height="200" src="https://bair.berkeley.edu/static/blog/smirl/tetris_ps.gif" />
<img height="220" src="https://bair.berkeley.edu/static/blog/smirl/minigrid-maze-random-count.gif" />
<br />
<i>
Left: Tetris. Right: HauntedHouse.
</i>
</p>
<h4 id="emergent-behavior">Emergent behavior</h4>
<p>The SMiRL agent demonstrates meaningful emergent behaviors in a number of
different environments. In the Tetris environment, the agent is able to learn
proactive behaviors to eliminate rows and properly play the game. The agent
also learns emergent game playing behavior in the VizDoom environment,
acquiring an effective policy for dodging the fireballs thrown by the enemies.
In both of these environments, stochastic and chaotic events force the SMiRL
agent to take a coordinated course of action to avoid unusual states, such as
full Tetris boards or fireball explorations.</p>
<!--
| Doom Hold The Line | Doom Defend The Line | HauntedHouse |
| ------------------------------------------------------------ | ------------------------------------------------------------ | :----------------------------------------------------------: |
| <img width="100%" src="https://bair.berkeley.edu/static/blog/smirl/Doom_trained_enough_result.gif"> | <img width="100%" src="https://bair.berkeley.edu/static/blog/smirl/vizdoom_dtl.gif"> | <img width="70%" src="https://bair.berkeley.edu/static/blog/smirl/minigrid-maze-random-count.gif"> |
<div class="t">
<table align="center">
<tr>
<td>
<img width="100%" src="https://bair.berkeley.edu/static/blog/smirl/Doom_trained_enough_result.gif">
</td>
<td>
<img width="100%" src="https://bair.berkeley.edu/static/blog/smirl/vizdoom_dtl.gif">
</td>
</tr>
<tr align=center>
<td>
Doom Hold The Line
</td>
<td>
Doom Defend The Line
</td>
</tr>
</table>
</div>
-->
<p style="text-align:center;">
<img width="45%" src="https://bair.berkeley.edu/static/blog/smirl/Doom_trained_enough_result.gif" />
<img width="45%" src="https://bair.berkeley.edu/static/blog/smirl/vizdoom_dtl.gif" />
<br />
<i>
Left: Doom Hold The Line. Right: Doom Defend The Line.
</i>
</p>
<h5 id="biped">Biped</h5>
<p>In the Cliff environment, the agent learns a policy that greatly reduces the
probability of falling off of the cliff by bracing against the ground and
stabilize itself at the edge, as shown in the figure below. In the <em>Treadmill</em>
environment, SMiRL learns a more complex locomotion behavior, jumping forward
to increase the time it stays on the treadmill, as shown in figure below.</p>
<!--
<div class="t">
<table align="center">
<tr>
<td>
<video width="320" height="240" autoplay> <source src="https://bair.berkeley.edu/static/blog/smirl/cliff_surpise_VAE_6_v3_rewardViz.mp4" type="video/mp4"> <source src="movie.ogg" type="video/ogg"> Your browser does not support the video tag. </video>
</td>
<td>
<video width="320" height="240" autoplay> <source src="https://bair.berkeley.edu/static/blog/smirl/treadmill_surpise_VAE_6_v3_rewardViz.mp4" type="video/mp4"> <source src="movie.ogg" type="video/ogg"> Your browser does not support the video tag. </video>
</td>
</tr>
<tr align=center>
<td>
Cliff
</td>
<td>
Treadmill
</td>
</tr>
</table>
</div>
-->
<p style="text-align:center;">
<video width="320" height="240" style="margin: 10px;" autoplay="">
<source src="https://bair.berkeley.edu/static/blog/smirl/cliff_surpise_VAE_6_v3_rewardViz.mp4" type="video/mp4" /> <source src="movie.ogg" type="video/ogg" /> Your browser does not support the video tag.
</video>
<video width="320" height="240" style="margin: 10px;" autoplay="">
<source src="https://bair.berkeley.edu/static/blog/smirl/treadmill_surpise_VAE_6_v3_rewardViz.mp4" type="video/mp4" /> <source src="movie.ogg" type="video/ogg" /> Your browser does not support the video tag.
</video>
<br />
<i>
Left: Cliff. Right: Treadmill.
</i>
</p>
<h4 id="comparison-to-intrinsic-motivation">Comparison to Intrinsic motivation:</h4>
<p>Intrinsic motivation is the idea that behavior is driven by internal reward
signals that are task independent. Below, we show plots of the
environment-specific rewards over time on Tetris, VizDoomTakeCover, and the
humanoid domains. In order to compare SMiRL to more standard intrinsic
motivation methods, which seek out states that maximize surprise or novelty, we
also evaluated ICM <a href="https://pathak22.github.io/noreward-rl/">(5)</a> and
RND <a href="https://arxiv.org/abs/1810.12894">(6)</a>. We include an oracle agent
that directly optimizes the task reward. On Tetris, after training for $2000$
epochs, SMiRL achieves near perfect play, on par with the oracle reward
optimizing agent, with no deaths. ICM seeks novelty by creating more and more
distinct patterns of blocks rather than clearing them, leading to deteriorating
game scores over time. On VizDoomTakeCover, SmiRL effectively learns to dodge
fireballs thrown by the adversaries.</p>
<p style="text-align:center;">
<img width="90%" src="https://bair.berkeley.edu/static/blog/smirl/video_game_comparisons_2.png" />
<br />
</p>
<p>The baseline comparisons for the Cliff and Treadmill environments have a
similar outcome. The novelty seeking behavior of ICM causes it to learn a type
of irregular behavior that causes the agent to jump off the Cliff and roll
around on the Treadmill, maximizing the variety (and quantity) of falls.</p>
<h4 id="smirl--curiosity">SMiRL + Curiosity:</h4>
<p style="text-align:center;">
<img width="90%" src="https://bair.berkeley.edu/static/blog/smirl/Capture_biped_results.png" />
<br />
</p>
<p>While on the surface, SMiRL minimizes surprise and curiosity approaches like
ICM maximize novelty, they are in fact not mutually incompatible. In
particular, while ICM maximizes novelty with respect to a learned transition
model, SMiRL minimizes surprise with respect to a learned state distribution.
We can combine ICM and SMiRL to achieve even better results on the Treadmill
environment.</p>
<!--
<div class="t">
<table align="center">
<tr>
<td>
<video width="320" height="240" autoplay> <source src="https://bair.berkeley.edu/static/blog/smirl/treadmill_surpise_ICM_v3_rewardViz.mp4" type="video/mp4"> <source src="movie.ogg" type="video/ogg"> Your browser does not support the video tag. </video>
</td>
<td>
<video width="320" height="240" autoplay> <source src="https://bair.berkeley.edu/static/blog/smirl/pedistal_surpise_v3_rewardViz.mp4" type="video/mp4"> <source src="movie.ogg" type="video/ogg"> Your browser does not support the video tag. </video>
</td>
</tr>
<tr align=center>
<td>
Treadmill + ICM
</td>
<td>
Pedestal
</td>
</tr>
</table>
</div>
-->
<p style="text-align:center;">
<video width="320" height="240" style="margin: 10px;" autoplay="">
<source src="https://bair.berkeley.edu/static/blog/smirl/treadmill_surpise_ICM_v3_rewardViz.mp4" type="video/mp4" /> <source src="movie.ogg" type="video/ogg" /> Your browser does not support the video tag.
</video>
<video width="320" height="240" style="margin: 10px;" autoplay="">
<source src="https://bair.berkeley.edu/static/blog/smirl/pedistal_surpise_v3_rewardViz.mp4" type="video/mp4" /> <source src="movie.ogg" type="video/ogg" /> Your browser does not support the video tag.
</video>
<br />
<i>
Left: Treadmill+ICM. Right: Pedestal.
</i>
</p>
<!--
<div class="containerWide">
<div class="photosWide">
<video width="320" height="240" style="margin: 20px;" autoplay>
<source src="https://bair.berkeley.edu/static/blog/smirl/treadmill_surpise_ICM_v3_rewardViz.mp4" type="video/mp4">
<source src="movie.ogg" type="video/ogg"> Your browser does not support the video tag.
</video>
<span class="wordWide"><i>Treadmill+ICM.</i></span>
</div>
<div class="photosWide">
<video width="320" height="240" style="margin: 20px;" autoplay>
<source src="https://bair.berkeley.edu/static/blog/smirl/pedistal_surpise_v3_rewardViz.mp4" type="video/mp4">
<source src="movie.ogg" type="video/ogg"> Your browser does not support the video tag.
</video>
<span class="wordWide"><i>Pedestal</i></span>
</div>
</div>
-->
<h4 id="insights">Insights:</h4>
<p>The key insight utilized by our method is that, in contrast to simple simulated
domains, realistic environments exhibit dynamic phenomena that gradually
increase entropy over time. An agent that resists this growth in entropy must
take active and coordinated actions, thus learning increasingly complex
behaviors. This is different from commonly proposed intrinsic exploration
methods based on novelty, which instead seek to visit novel states and increase
entropy. SMiRL holds promise for a new kind of unsupervised RL method that
produces behaviors that are closely tied to the prevailing disruptive forces,
adversaries, and other sources of entropy in the environment.</p>
<ul>
<li>Glen Berseth, Daniel Geng, Coline Devin, Chelsea Finn, Dinesh Jayaraman, Sergey Levine. <br />
<a href="https://arxiv.org/abs/1912.05510">SMiRL: Surprise Minimizing RL in Dynamic Environments</a> <br />
<a href="https://sites.google.com/view/surpriseminimization">Project Website</a></li>
</ul>
Wed, 18 Dec 2019 01:00:00 -0800
https://bairblog.github.io/2019/12/18/smirl/
https://bairblog.github.io/2019/12/18/smirl/What is My Data Worth?<meta name="twitter:title" content="What is My Data Worth?" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/data-worth/1.png" />
<p>People give massive amounts of their personal data to companies every day and
these data are used to generate tremendous business values. Some
<a href="https://www.gsb.stanford.edu/insights/how-much-your-private-data-worth-who-should-own-it">economists</a>
and
<a href="https://www.cnbc.com/2019/10/17/andrew-yang-facebook-amazon-google-should-pay-for-users-data.html">politicians</a>
argue that people should be paid for their contributions—but the
million-dollar question is: by how much?</p>
<p>This article discusses methods proposed in our recent
<a href="https://arxiv.org/pdf/1902.10275.pdf">AISTATS</a> and
<a href="https://arxiv.org/pdf/1908.08619.pdf">VLDB</a> papers that attempt to answer this
question in the machine learning context. This is joint work with David Dao,
Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Nick Hynes, Bo Li, Ce Zhang,
Costas J. Spanos, and Dawn Song, as well as a collaborative effort between UC
Berkeley, ETH Zurich, and UIUC. More information about the work in our group
can be found <a href="https://sunblaze-ucb.github.io/privacy/">here</a>.</p>
<!--more-->
<h1 id="what-are-the-existing-approaches-to-data-valuation">What are the existing approaches to data valuation?</h1>
<p>Various ad-hoc data valuation schemes have been studied in the literature and
some of them have been deployed in the existing data marketplaces. From a
practitioner’s point of view, they can be grouped into three categories:</p>
<ul>
<li>
<p><strong>Query-based pricing</strong> attaches values to user-initiated queries. One simple
example is to set the price based on the number of queries allowed during a
time window. <a href="https://homes.cs.washington.edu/~suciu/file07_paper.pdf">Other more sophisticated
examples</a> attempt to
adjust the price to some specific criteria, such as arbitrage avoidance.</p>
</li>
<li>
<p><a href="https://www.ideals.illinois.edu/bitstream/handle/2142/73449/207_ready.pdf?sequence=2"><strong>Data attribute-based
pricing</strong></a>
constructs a price model that takes into account various parameters, such as
data age, credibility, potential benefits, etc. The model is trained to match
market prices released in public registries.</p>
</li>
<li>
<p><a href="https://ieeexplore.ieee.org/abstract/document/5466993"><strong>Auction-based
pricing</strong></a> designs
auctions that dynamically set the price based on bids offered by buyers and
sellers.</p>
</li>
</ul>
<p>However, existing data valuation schemes do not take into account the following
important desiderata:</p>
<ul>
<li>
<p><strong>Task-specificness</strong>: The value of data depends on the task it helps to
fulfill. For instance, if Alice’s medical record indicates that she has
disease A, then her data will be more useful to predict disease A as opposed
to other diseases.</p>
</li>
<li>
<p><strong>Fairness</strong>: The quality of data from different sources varies dramatically.
In the worst-case scenario, adversarial data sources may even degrade model
performance via data poisoning attacks. Hence, the data value should reflect
the efficacy of data by assigning high values to data which can notably
improve the model’s performance.</p>
</li>
<li>
<p><strong>Efficiency</strong>: Practical machine learning tasks may involve thousands or
billions of data contributors; thus, data valuation techniques should be
capable of scaling up.</p>
</li>
</ul>
<p>With the desiderata above, we now discuss a principled notion of data value and
computationally efficient algorithms for data valuation.</p>
<h1 id="what-would-be-a-good-notion-for-data-value">What would be a good notion for data value?</h1>
<p>Due to the task-specific nature of data value, it should depend on the utility
of the machine learning model trained on the data. Suppose the machine learning
model generates a specific amount of profit. Then, we can reduce the data
valuation problem to a profit allocation problem, which splits the total
utility of the machine learning model between different data sources. Indeed,
it is a well-studied problem in cooperative game theory to fairly allocate
profits created by collective efforts. The most prominent profit allocation
scheme is the Shapley value. The Shapley value attaches a real-value number to
each player in the game to indicate the relative importance of their
contributions. Specifically, for $N$ players, the Shapley value of the player
$i$ ($i\in I={1,\ldots,N}$) is defined as</p>
<script type="math/tex; mode=display">s_i = \sum_{S\subseteq I\setminus\{i\}} \frac{1}{N{N-1\choose |S|}}[U(S\cup \{i\})-U(S)]</script>
<p>where $U(S)$ is the utility function that evaluates the worth of the player
subset S. In the definition above, the difference in the bracket measures how
much the payoff increases when player $i$ is added to a particular subset $S$;
thus, the Shapley value measures the average contribution of player $i$ to
every possible group of other players in the game.</p>
<p>Relating these game theoretic concepts to the problem of data valuation, one
can think of the players as training data sources, and accordingly, the utility
function $U(S)$ as a performance measure of the model trained on the subset S
of training data. Thus, the Shapley value can be used to determine the value of
each data source. The Shapley value is appealing because it is the <em>only</em>
profit allocation scheme that satisfies the following properties:</p>
<ul>
<li>
<p><strong>Group rationality</strong>: the total utility of the machine learning model is
completely split between different data sources, i.e., $\sum_{i=1}^N s_i =
U(I)$. This is a natural requirement because data contributors would expect
the total benefit to be fully distributed.</p>
</li>
<li>
<p><strong>Fairness</strong>: Two data sources that have identical contributions to the model
utility should have the same value; moreover, data sources with zero
contributions to all subsets of the dataset should not receive any payoff.</p>
</li>
<li>
<p><strong>Additivity</strong>: The values under multiple utilities add up to the value under
a utility that is the sum of all these utilities. This property generalizes
the data valuation for a single task to multiple tasks. Specifically, if each
task is associated with a utility function as the performance measure, with
the additivity property, we can calculate the multi-task data value by simply
computing the Shapley value with respect to the aggregated utility function.</p>
</li>
</ul>
<p>Because the Shapley value uniquely satisfies the aforementioned properties and
naturally leads to a payoff scheme dependent on the underlying task, we employ
the Shapley value as a data value notion. While the outlined concept appears
plausible, it has some fundamental challenges: computing the Shapley value, in
general, requires evaluating the utility function for an exponential number of
times; even worse, evaluating the utility function means re-training the model
in the machine learning context. This is clearly intractable even for a small
dataset. Interestingly, by focusing on the machine learning context, some
opportunities arise to address the scalability challenge. Next, we show that
for the K-nearest neighbors (KNN) classification, one can obviate the need to
re-train models and compute the Shapley value in quasi-linear time—an
exponential improvement in computational efficiency!</p>
<h1 id="efficient-algorithms-for-knn">Efficient algorithms for KNN</h1>
<p>To understand why KNN is amenable to efficient data valuation, we consider
$K=1$ and investigate the following simple utility function defined for 1NN:
$U(S)=1$ if the label of a test point is correctly predicted by its nearest
neighbor in $S$ and $0$ otherwise. For a given test point, the utility of a set
is completely determined by the nearest neighbor in this set to the test point.
Thus, the contribution of the point $i$ to a subset $S$ is zero if the nearest
neighbor in S is closer to the test point than $i$. When we re-examine the
Shapley value, we observe that for many $S$, $U(S\cup{i})-U(S)=0$.
Figure 1 illustrates an example of such an $S$. This simple example shows the
computational requirement of the Shapley value can be significantly reduced for
KNN.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/data-worth/1.png" width="500" />
<br />
<i>
Figure 1: Illustration of why KNN is amenable to efficient Shapley value
computation.
</i>
</p>
<p>For a given test point $(x_\text{test},y_\text{test})$, we let $\alpha_k(S)$
denote the $k$th nearest neighbor in $S$ to the test point. Consider the
following utility function that measures the likelihood of predicting the right
label of a particular test point for KNN:</p>
<script type="math/tex; mode=display">U(S) = \frac{1}{K}\sum_{k=1}^{\min\{K,|S|\}}\mathbb{1}[y_{\alpha_k(S)}=y_\text{test}]</script>
<p>Now assume that the training data is sorted according to their similarity to
the test point. We develop a simple recursive algorithm to compute the Shapley
value of all training points from the furthest neighbor of the test point to
the nearest one. Let $\mathbb{I}[\cdot]$ represent the indicator function.
Then, the algorithm proceeds as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
&s_N = \frac{\mathbb{I}[y_N=y_\text{test}]}{N}\\
&s_i = s_{i+1} + \frac{\mathbb{I}[y_i=y_\text{test}]-\mathbb{I}[y_{i+1}=y_\text{test}]}{K} \frac{\min\{K,i\}}{i}
\end{align*} %]]></script>
<p>This algorithm can be extended to the case where the utility is defined as the
likelihood of predicting the right labels for multiple test points. With the
additivity property, the Shapley value for multiple test points is the sum of
the Shapley value for every test point. The computational complexity is
$\mathcal{O}(N\log NN_\text{test})$ for $N$ training points and $N_\text{test}$
test points—this is simply the complexity of a sorting algorithm!</p>
<p>We can also develop a similar recursive algorithm to compute the Shapley value
for KNN regression. Moreover, in some applications, such as document retrieval,
test points could arrive sequentially and the value of each training point
needs to be updated and accumulated on the fly, which makes it impossible to
complete sorting offline. However, sorting a large dataset with a high
dimension in an online manner will be expensive. To address the scalability
challenge in the online setting, we develop an approximation algorithm to
compute the Shapley value for KNN with improved efficiency. The efficiency
boost is achieved by utilizing the locality-sensitive hashing to circumvent the
need of sorting. More details of these extensions can be found in <a href="https://arxiv.org/pdf/1908.08619.pdf">our
paper</a>.</p>
<h1 id="improving-the-efficiency-for-other-ml-models">Improving the efficiency for other ML models</h1>
<p>The Shapley value for KNN is efficient due to the special locality structure of
KNN. For general machine learning models, the exact computation of the Shapley
value is inevitably slower. To address this challenge, prior work often resorts
to Monte Carlo-based approximation algorithms. The central idea behind these
approximation algorithms is to treat the Shapley value of a training point as
its expected contribution to a random subset and use the sample average to
approximate the expectation. By the definition of the Shapley value, the random
set has size $0$ to $N-1$ with equal probability (corresponding to the $1/N$
factor) and is also equally likely to be any subset of a given size
(corresponding to the $1/{N-1\choose |S|}$ factor). In practice, one can
implement an equivalent sampler by drawing a random permutation of the training
set. Then, the approximation algorithm proceeds by computing the marginal
utility of a point to the points preceding it and averaging the marginal
utilities across different permutations. This was the state-of-the-art method
to estimate the Shapley value for general utility functions (referred to as the
baseline approximation later). To assess the performance of an approximation
algorithm, we can look at the number of utility evaluations needed to achieve
some guarantees of the approximation error. Using Hoeffding’s bound, it can be
proved that the baseline approximation algorithm above needs
$\mathcal{O}(N^2\log N)$ utility evaluations so that the squared error between the estimated and
the ground truth Shapley value is bounded with high probability. Can we reduce
the number of utility evaluations while maintaining the same approximation
error guarantee?</p>
<p>We developed an approximation algorithm that requires only $\mathcal{O}(N(\log
N)^2)$ utility evaluations by utilizing the information sharing between
different random samples.
The key idea is that if a
data point has a high value, it tends to boost the utility of all subsets
containing it. This inspires us to draw some random subsets and record the
presence of each training point in these randomly selected subsets. Denoting
the appearance of the $i$th and $j$th training data by $\beta_i$ and $\beta_j$.
We can smartly design the distribution of the random subsets so that the
expectation of $(\beta_i-\beta_j)U(\beta_1,\ldots,\beta_N)$ is equal to
$s_i-s_j$. We can pick an anchor point, say, $s_1$, and use the sample average
of $(\beta_i-\beta_1)U(\beta_1,\ldots,\beta_N)$ for all $i=2,\ldots,N$ to
estimate the Shapley value difference from all other training points to $s_1$.
Then, we can simply perform a few more utility evaluations to estimate $s_1$,
which allows us to recover the Shapley value of all other points. More details
of this algorithm can be found in <a href="https://arxiv.org/pdf/1902.10275.pdf">our
paper</a>. Since this algorithm computes the
Shapley value by simply examining the utility of groups of data, we will refer
to this algorithm as the group testing-based approximation hereinafter. Our
paper also discusses even more efficient ways to estimate the Shapley value
when new assumptions can be made, such as the sparsity of the Shapley values
and the stability of the underlying learning algorithm.</p>
<h1 id="experiments">Experiments</h1>
<p>First, we demonstrate the efficiency of the proposed method to compute the
exact Shapley value for KNN. We benchmark the runtime using a 2.6 GHZ Intel
Core i7 CPU and compare the exact algorithm with the baseline Monte-Carlo
approximation. Figure 2(a) shows the Monte-Carlo estimate of the Shapley value
for each training point converges to the result of the exact algorithm with
enough simulations, thus indicating the correctness of our exact algorithm.
More importantly, the exact algorithm is several orders of magnitude faster
than the baseline approximation as shown in Figure 2(b) .</p>
<p style="text-align:center;">
<!--
<img src="https://bair.berkeley.edu/static/blog/data-worth/2_a.png" width="300">
<img src="https://bair.berkeley.edu/static/blog/data-worth/2_b.png" width="300">
-->
<img src="https://bair.berkeley.edu/static/blog/data-worth/figure_2.png" width="" />
<br />
<i>
Figure 2: (a) The Shapley value produced by our proposed exact approach and the
baseline Monte-Carlo approximation algorithm for the KNN classifier constructed
with 1000 randomly selected training points from MNIST. (b) Runtime comparison
of the two approaches as the training size increases.
</i>
</p>
<p>With the proposed algorithm, for the first time, we can compute data values for
a practically large database. Figure 3 illustrates the result of a large-scale
experiment using the KNN Shapley value. We take 1.5 million images with
pre-calculated features and labels from Yahoo Flickr Creative Commons 100
Million (YFCC100m) dataset. We observe that the KNN Shapley value is
intuitive—the top-valued images are semantically correlated with the
corresponding test image. This experiment takes only a few seconds per test
image on a single CPU and can be parallelized for a large test set.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/data-worth/3.png" width="" />
<br />
<i>
Figure 3: Data valuation using KNN classifiers (K = 10) on 1.5 million images
(all images with pre-calculated deep feature representations in the Yahoo100M
dataset).
</i>
</p>
<p>Similarly, Figure 4(a) demonstrates the accuracy of our proposed group
testing-based approximation and Figure 4(b) shows that the group testing-based
approximation outperforms the baseline approximation by several orders of
magnitude for a large number of data points.</p>
<p style="text-align:center;">
<!--
<img src="https://bair.berkeley.edu/static/blog/data-worth/4_a.png" width="350">
<img src="https://bair.berkeley.edu/static/blog/data-worth/4_b.png" width="350">
-->
<img src="https://bair.berkeley.edu/static/blog/data-worth/figure_4.png" width="" />
<br />
<i>
Figure 4: The Shapley value produced by our proposed group testing-based
approximation and the baseline approximation algorithm for a logistic
regression classifier trained on the Iris dataset. (b) Runtime comparison of
the two approaches.
</i>
</p>
<p>We also perform experiments to demonstrate the utility of the Shapley value
beyond data marketplace applications. Since the Shapley value tells us how
useful a data point is for a machine learning task, we can use it to identify
the low-quality or even adversarial data points in the training set. As a
simple example, we artificially create a training set with half of the data
directly from MNIST and the other half perturbed with random noise. In
Figure 5, we compare the Shapley value between normal and noisy data as the
noise ratio becomes higher. The figure shows that the Shapley value can be used
to effectively detect noisy training data.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/data-worth/5.png" width="400" />
<br />
<i>
Figure 5: The Shapley value of normal and noisy training data as the noise
magnitude becomes higher.
</i>
</p>
<p>The Shapley value can also be used to understand adversarial training, which is
an effective method to improve the adversarial robustness of a model by
introducing adversarial examples to the training dataset. In practice, we
measure the robustness in terms of the test accuracy on a dataset containing
adversarial examples. We expect that the adversarial examples in the training
dataset become more valuable as more adversarial examples are added into the
test dataset. Based on the MNIST, we construct a training dataset that contains
both benign and adversarial examples and synthesize test datasets with
different adversarial-benign mixing ratios. Two popular attack algorithms,
namely, the <a href="https://arxiv.org/abs/1412.6572">fast gradient sign method</a> (FGSM)
and the <a href="https://arxiv.org/abs/1705.07263">iterative attack</a> (CW) are used to
generate adversarial examples. Figure 6(a) and (b) compare the average Shapley
value for adversarial examples and for benign examples in the training dataset.
The negative test loss for logistic regression is used as the utility function.
We see that the Shapley value of adversarial examples increases as the test
data becomes more adversarial; in contrast, the Shapley value of benign
examples decreases. In addition, the adversarial examples in the training set
are more valuable if they are generated from the same attack algorithm during
test time.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/data-worth/6.png" width="" />
<br />
<i>
Figure 6: Comparison of the Shapley value of benign and adversarial examples.
FGSM and CW are different attack algorithms used for generating adversarial
examples in the test dataset: (a) (resp. (b)) is trained on Benign+FGSM (resp.
CW) adversarial examples.
</i>
</p>
<h1 id="conclusion">Conclusion</h1>
<p>We hope that our approaches for data valuation provide the theoretical and
computational tools to facilitate data collection and dissemination in future
data marketplaces. Beyond data markets, the Shapley value is a versatile tool
for machine learning practitioners; for instance, it can be used for selecting
features or interpreting black-box model predictions. Our algorithms can also
be applied to mitigate the computational challenges in these important
applications.</p>
Mon, 16 Dec 2019 01:00:00 -0800
https://bairblog.github.io/2019/12/16/data-worth/
https://bairblog.github.io/2019/12/16/data-worth/Learning to Imitate Human Demonstrations via CycleGAN<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/humans-cyclegan/intro.gif" width="" />
<br />
<i>
This work presents AVID, a method that allows a robot to learn a task, such as
making coffee, directly by watching a human perform the task.
</i>
</p>
<p>One of the most important markers of intelligence is the ability to learn by
watching others. Humans are particularly good at this, often being able to
learn tasks by observing other humans. This is possible because we are not
simply copying the actions that other humans take. Rather, we first <em>imagine</em>
ourselves performing the task, and this provides a starting point for further
<em>practicing</em> the task in the real world.</p>
<p>Robots are not yet adept at learning by watching humans or other robots. Prior
methods for <em>imitation learning</em>, where robots learn from demonstrations of the
task, typically assume that the demonstrations can be given directly through
the robot, using techniques such as <a href="https://ieeexplore.ieee.org/document/6249584">kinesthetic
teaching</a> or
<a href="https://sites.google.com/view/vrlfd/">teleoperation</a>. This assumption limits
the applicability of robots in the real world, where robots may be frequently
asked to learn new tasks quickly and without programmers, trained roboticists,
or specialized hardware setups. Can we instead have robots learn directly from
a video of a human demonstration?</p>
<!--more-->
<p>This work presents <a href="https://sites.google.com/view/icra20avid">AVID</a>, a method
that enables robotic imitation learning from human videos through a strategy,
similar to humans, of imagination and practice. Given human demonstration
videos, AVID first translates these demonstrations into videos of the robot
performing the task, by means of image-to-image translation. In order to
translate human videos to robot videos directly at the pixel level, we use
<a href="https://junyanz.github.io/CycleGAN/">CycleGAN</a>, a recently proposed model that
can learn image-to-image translation between two domains using unpaired images
from each domain.</p>
<p>To handle complex, multi-stage tasks, we extract <em>instruction images</em> from
these translated robot demonstrations, which depict key stages of the task.
These instructions then define a reward function for a model-based
reinforcement learning (RL) procedure that allows the robot to practice the
task in order to learn its execution.</p>
<p>The main goal of AVID is to minimize the human burden associated with defining
the task and supervising the robot. Providing rewards via human videos handles
the task definition, however there is still human cost during the actual
learning process. AVID addresses this by having the robot learn to reset each
stage of the task on its own, in order to be able to practice multiple times
without requiring manual intervention. Thus, the only human involvement
required at robot learning time is in the form of key presses and a few manual
resets. We demonstrate that this approach is capable of solving complex,
long-horizon tasks with minimal human involvement, removing most of the human
burden associated with instrumenting the task setup, manually resetting the
environment, and supervising the learning process.</p>
<h2 id="automated-visual-instruction-following-with-demonstrations">Automated Visual Instruction-Following with Demonstrations</h2>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/humans-cyclegan/diagram.gif" width="" />
<br />
<i>
Our method, AVID, translates human instruction images into the corresponding
robot instruction images via CycleGAN and uses model-based RL to learn how to
complete each instruction.
</i>
</p>
<p>We name our approach <strong>a</strong>utomated <strong>v</strong>isual <strong>i</strong>nstruction-following with
<strong>d</strong>emonstrations, or AVID. AVID relies on several key ideas in image-to-image
translation and model-based RL, and here we will discuss each of these
components.</p>
<h3 id="translating-human-videos-to-robot-videos">Translating Human Videos to Robot Videos</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/humans-cyclegan/horsezebra.gif" height="140" />
<img src="https://bair.berkeley.edu/static/blog/humans-cyclegan/humanrobot.gif" height="140" />
<br />
<i>
Left: CycleGAN has been successful for tasks such as translating from videos of
horses to videos of zebras. Right: We apply CycleGAN to the task of translating
from human demonstration videos to robot demonstration videos.
</i>
</p>
<p>CycleGAN has previously been shown to be effective on a number of domains, such
as frame-by-frame translation of videos of <a href="https://www.youtube.com/watch?v=9reHvktowLY">horses into
zebras</a>. Thus, we train a CycleGAN
where the domains are human and robot images: for training data, we collect
demonstrations from the human and random movements from both the human and
robot. Through this, we obtain a CycleGAN that is capable of generating fake
robot demonstrations from human demonstrations, as depicted above.</p>
<p>Though the robot demonstration is visually realistic for the most part, the
translated video will inevitably exhibit artifacts, such as the coffee cup
warping and the robot gripper being displaced from the arm. This makes learning
from the full video ineffective, and so we devise an alternate strategy that
does not rely on the full video. Specifically, we extract <em>instruction images</em>
from the translated video that depict key stages of the task – for example,
for the coffee making task shown above, the instructions consist of grasping
the cup, placing the cup in the coffee machine, and pushing the button on top
of the machine. By only using specific images rather than the entire video, the
learning process is less affected by imperfect translated demonstrations.</p>
<h3 id="accomplishing-instructions-through-planning">Accomplishing Instructions through Planning</h3>
<p>The instructions images that we extract from the demonstration split up the
overall task into stages, and AVID uses a <a href="https://sites.google.com/view/drl-in-a-handful-of-trials/home">model-based planning
algorithm</a> to
try and complete each stage of the task. Specifically, using the robot data we
collect for CycleGAN training along with the translated instructions, we learn
a <em>dynamics model</em> along with a set of <em>instruction classifiers</em> that predict
when each instruction has been successfully accomplished. When attempting stage
$s$, the algorithm samples actions, predicts the resulting states using the
dynamics model, and then selects the action that is predicted by the classifier
for stage $s$ to have the highest chance of success. This algorithm repeatedly
selects actions for a specified number of time steps or until the classifier
signals success, i.e., the robot believes that it has completed the current
stage.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/humans-cyclegan/pgm.gif" width="" />
<br />
<i>
We use a structured latent variable model, similar to the SLAC model, to learn
a state representation based on image observations and robot actions.
</i>
</p>
<p><a href="https://bair.berkeley.edu/blog/2019/05/20/solar/">Prior</a>
<a href="https://danijar.com/project/planet/">work</a> has shown that training a
<em>structured latent variable model</em> is an effective strategy for learning tasks
in image-based domains. At a high level, we want our robot to extract a <em>state
representation</em> from its visual input that is low-dimensional and simpler to
learn from than directly learning from image pixels. This is accomplished using
a model similar to the <a href="https://alexlee-gk.github.io/slac/">SLAC model</a>, which
introduces a latent state, decomposed into two parts, that evolve according to
the learned dynamics model and give rise to the robot images according to a
learned neural network <em>decoder</em>. When presented with an image observation, the
robot can then <em>encode</em> the image, with another neural network, into a latent
state and operate at the level of states rather than pixels.</p>
<h3 id="instruction-following-via-model-based-reinforcement-learning">Instruction-Following via Model-Based Reinforcement Learning</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/humans-cyclegan/alg.gif" width="600" />
<br />
<i>
AVID uses model-based planning to accomplish instructions, querying the human
when the classifier signals success and automatically resetting when the
instruction is not achieved.
</i>
</p>
<p>When attempting stage $s$, the planning algorithm will continue selecting
actions for a maximum number of time steps or until the classifier for stage
$s$ signals success. In the latter case, the robot stops and queries the human,
who indicates via a key press whether or not the robot has actually succeeded.
If the human indicates success, the robot moves on to stage $s+1$. However, if
the human indicates failure, then the robot will switch to planning with the
classifier from the <em>previous</em> stage, i.e., stage $s-1$. In this way, the robot
automatically attempts to reset to the beginning of stage $s$ in order to
position itself to try the stage again. This entire procedure ends when the
human indicates success for the final stage, at which point the robot has
completed the entire task.</p>
<p>By having the robot automatically attempt to reset itself, we reduce the human
burden in having to manually reset the environment, as this is only required
when there are problems such as the cup falling over. For the most part, the
human is only required to provide key presses during the training process,
which is much simpler and less intensive than manual intervention. Furthermore,
the stage-wise resetting and retrying allows the robot to practice difficult
stages of the task, which focuses the learning process and robustifies the
robot’s behavior. As shown in the next section, AVID is capable of solving
complicated multi-stage tasks on a real Sawyer robot arm directly from human
demonstration videos and minimal human supervision.</p>
<h2 id="experiments">Experiments</h2>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/humans-cyclegan/coffee.gif" width="350" />
<img src="https://bair.berkeley.edu/static/blog/humans-cyclegan/drawer.gif" width="350" />
<br />
<i>
We demonstrate that AVID is able to learn multi-stage tasks, including
operating a coffee machine and retrieving a cup from a drawer, on a real Sawyer
robotic arm.
</i>
</p>
<p>We ran our experiments on a Sawyer robotic arm, a seven degree of freedom
manipulator that we tasked with operating a coffee machine and retrieving a cup
from a closed drawer, as depicted above. On both tasks, we compared to
<a href="https://sermanet.github.io/tcn/">time-contrastive networks</a> (TCN), a prior
method that also can learn robot skills from human demonstrations. We also
ablated our method to learn from full demonstrations, which we refer to as the
“imitation ablation”, and to operate directly at the pixel level, which we term
the “pixel-space ablation”. Finally, in the setting where we have access to
demonstrations given directly through the robot, which is an assumption made in
most prior work in imitation learning, we compared to <a href="https://arxiv.org/abs/1805.01954">behavioral cloning from
observations</a> (BCO) and a standard behavioral
cloning approach. For additional details about the experiments, such as
hyperparameters and data collection, please refer to the
<a href="https://arxiv.org/abs/1912.04443">paper</a>.</p>
<h3 id="task-setup">Task Setup</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/humans-cyclegan/setup.png" width="" />
<br />
<i>
Instruction images given by the human (top) and translated into the robot’s
domain (bottom) for the coffee making (left) and cup retrieval (right) tasks.
</i>
</p>
<p>We specified three stages for the coffee making task as depicted above.
Starting from the initial state on the left, the instructions were to pick up
the cup, place the cup in the machine, and press the button on top of the
machine. We used a total of 30 human demonstrations for this task, amounting to
about 20 minutes of human time. Cup retrieval is a more complicated task, and
we specified five stages here. From the initial state, the instructions were to
grasp the drawer handle, open the drawer, move the arm up and out of the way,
grasp the cup, and place the cup on top of the drawer. The middle stage of
moving the arm was important so that the robot did not hit the handle and
accidentally close the drawer, and this highlights an additional benefit of
AVID, as specifying this additional instruction was as simple as segmenting out
another time step within the human videos. For cup retrieval, we used 20 human
demonstrations, again amounting to about 20 minutes of human time.</p>
<h3 id="results">Results</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/humans-cyclegan/table.png" width="600" />
<br />
<i>
AVID significantly outperforms ablations and prior methods that use human
demonstrations on the tasks we consider. AVID is also competitive with, and
sometimes even outperforms, baseline methods that use real demonstrations given
on the robot itself.
</i>
</p>
<!--
https://youtu.be/dhJfFqQmZ1c
-->
<div class="videoWrapper">
<iframe src="https://www.youtube.com/embed/dhJfFqQmZ1c?rel=0" frameborder="0" allowfullscreen=""></iframe>
</div>
<p>The table and video above summarize the results of running AVID and the
comparisons on the coffee making and cup retrieval tasks. AVID exhibits strong
performance and successfully completes all stages of both tasks most of the
time, with essentially perfect performance in the beginning stages. As the
video shows, AVID constantly makes use of automated resetting and retrying
during both training and the final evaluation, and failures typically
correspond to small, but significant, errors such as knocking the cup over.
AVID also performs significantly better than either the imitation or
pixel-space ablations, demonstrating the advantages obtained through stage-wise
training and learning a latent variable model. Finally, TCN can learn the
earlier stages of cup retrieval but is generally unsuccessful otherwise.</p>
<p>We also evaluate two methods that assume access to real robot demonstrations,
which AVID does not require. First, BCO uses only the image observations from
the demonstrations, and the performance of this method falls off sharply for
the later stages of each task. This highlights the difficulty of learning
temporally extended tasks directly from the full demonstrations. Finally, we
compare to behavioral cloning, which uses both the robot observations and
actions, and we note that this method is the strongest baseline as it uses the
most privileged information out of all the comparisons. However, we find that
AVID still outperforms behavioral cloning for cup retrieval, and this is most
likely due to the explicit stage-wise training that AVID employs.</p>
<h2 id="related-work">Related Work</h2>
<p>As mentioned above, most <a href="https://www.sciencedirect.com/science/article/pii/S0921889008001772">prior work on imitation
learning</a>
has assumed that demonstrations can be given directly on the robot, rather than
learning directly from human videos. However, learning from humans videos has
also been studied, through various methods such as
<a href="https://www.sciencedirect.com/science/article/pii/S0921889013001449">pose</a> and
<a href="https://www.ias.informatik.tu-darmstadt.de/uploads/ALR2014/Yang_ALR2014.pdf">object</a>
<a href="https://www.sciencedirect.com/science/article/pii/S0004370215001320">detection</a>,
<a href="https://arxiv.org/abs/1612.07796">predictive</a>
<a href="https://arxiv.org/abs/1703.02658">modeling</a>,
<a href="https://arxiv.org/abs/1707.03374">context</a>
<a href="https://arxiv.org/abs/1911.09676">translation</a>, learning
<a href="https://arxiv.org/abs/1612.06699">reward</a>
<a href="https://arxiv.org/abs/1704.06888">representations</a>, and
<a href="https://bair.berkeley.edu/blog/2018/06/28/daml/">meta-learning</a>. The key
differences between these methods and AVID is that AVID directly translates
human demonstration videos at the pixel level in order to explicitly handle the
change in embodiment.</p>
<p>Furthermore, we evaluate on complex multi-stage tasks, and AVID’s ability to
solve these tasks is enabled in part by the incorporation of explicit
stage-wise training, where resets are learned for each stage. Prior work in RL
has also investigated
<a href="https://people.eecs.berkeley.edu/~pabbeel/papers/2015-IROS-learning-compound-controllers.pdf">learning</a>
<a href="https://arxiv.org/abs/1711.06782">resets</a>, similarly demonstrating that doing
so allows for learning multi-stage tasks and reduces human burden and the need
for manual resets. AVID combines ideas in reset learning, image-to-image
translation, and model-based RL in order to learn temporally extended tasks
directly from image observations in the real world, using only a modest number
of human demonstrations.</p>
<h2 id="future-work">Future Work</h2>
<p>The most exciting direction for future work is to extend the capabilities of
the general CycleGAN in order to enable efficient learning of a wide array of
tasks given only a few human videos of the task. Imagine a CycleGAN that is
trained on a large dataset of kitchen interactions, consisting of a coffee
machine, multiple drawers, and numerous other objects. If the CycleGAN is able
to reliably translate human demonstrations involving any of these objects, then
this opens up the possibility of a general-purpose kitchen robot that can
quickly pick up any task simply through observation and a small amount of
practice. Pursuing this line of research is a promising avenue for enabling
capable and useful robots that can truly learn by watching humans.</p>
<p>This post is based on the following paper:</p>
<ul>
<li>Laura Smith, Nikita Dhawan, Marvin Zhang, Pieter Abbeel, Sergey Levine.<br />
<a href="https://arxiv.org/abs/1912.04443"><strong>AVID : Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos</strong></a> <br />
<a href="https://sites.google.com/view/icra20avid">Project webpage</a></li>
</ul>
<p>We would like to thank Sergey Levine for providing feedback on this post.</p>
Fri, 13 Dec 2019 01:00:00 -0800
https://bairblog.github.io/2019/12/13/humans-cyclegan/
https://bairblog.github.io/2019/12/13/humans-cyclegan/Model-Based Reinforcement Learning:<br>Theory and Practice<meta name="twitter:title" content="Model-Based Reinforcement Learning: Theory and Practice" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://people.eecs.berkeley.edu/~janner/mbpo/blog/figures/teaser-01.png" />
<article class="post-content">
<!-- begin section I: introduction -->
<p>
Reinforcement learning systems can make decisions in one of two ways. In the <i>model-based</i> approach, a system uses a predictive model of the world to ask questions of the form “what will happen if I do <i>x</i>?” to choose the best <i>x</i><sup id="fnref:naming-conventions"><a href="#fn:naming-conventions" class="footnote"><font size="-2">1</font></a></sup>. In the alternative <i>model-free</i> approach, the modeling step is bypassed altogether in favor of learning a control policy directly. Although in practice the line between these two techniques can become blurred, as a coarse guide it is useful for dividing up the space of algorithmic possibilities.
</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~janner/mbpo/blog/figures/teaser-01.png" width="75%" />
<br />
<i>Predictive models can be used to ask “what if?” questions to guide future decisions.</i>
</p>
<p>
The natural question to ask after making this distinction is whether to use such a predictive model. The field has grappled with <a href="https://www.cs.cmu.edu/~tom/10701_sp11/slides/Kaelbling.pdf#page=15">this question</a> for quite a while, and is unlikely to reach a consensus any time soon. However, we have learned enough about designing model-based algorithms that it is possible to draw some general conclusions about best practices and common pitfalls. In this post, we will survey various realizations of model-based reinforcement learning methods. We will then describe some of the tradeoffs that come into play when using a learned predictive model for training a policy and how these considerations motivate a simple but effective strategy for model-based reinforcement learning. The latter half of this post is based on our recent paper on <a href="https://arxiv.org/abs/1906.08253">model-based policy optimization</a>, for which code is available <a href="https://github.com/JannerM/mbpo">here</a>.
</p>
<!--more-->
<!-- begin section II: model-based techniques -->
<h2 id="model-based-techniques">Model-based techniques</h2>
<p>
Below, model-based algorithms are grouped into four categories to highlight the range of uses of predictive models. For the comparative performance of some of these approaches in a continuous control setting, this <a href="https://arxiv.org/abs/1907.02057">benchmarking paper</a> is highly recommended.
</p>
<p>
<strong>Analytic gradient computation</strong>
</p>
<p>
Assumptions about the form of the dynamics and cost function are convenient because they can yield closed-form solutions for locally optimal control, as in the <a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa19/slides/Lec5-LQR.pdf#page=11">LQR framework</a>. Even when these assumptions are not valid, <a href="https://homes.cs.washington.edu/~todorov/papers/LiICINCO04.pdf">receding</a>-<a href="https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf">horizon</a> <a href="http://www.youtube.com/watch?v=anIsw2-Lbco&t=3m5s">control</a> can account for small errors introduced by approximated dynamics. Similarly, dynamics models parametrized as <a href="http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf">Gaussian processes</a> have analytic gradients that can be used for policy improvement. Controllers derived via these simple parametrizations can also be used to provide <a href="https://graphics.stanford.edu/projects/gpspaper/gps_full.pdf">guiding samples</a> for training more complex nonlinear policies.
</p>
<p>
<strong>Sampling-based planning</strong>
</p>
<p>
In the fully general case of nonlinear dynamics models, we lose guarantees of local optimality and must resort to sampling action sequences. The simplest version of this approach, <a href="https://arxiv.org/abs/1708.02596">random shooting</a>, entails sampling candidate actions from a fixed distribution, evaluating them under a model, and choosing the action that is deemed the most promising. More sophisticated variants iteratively adjust the sampling distribution, as in the <a href="https://www.sciencedirect.com/science/article/pii/B9780444538598000035">cross-entropy method</a> (CEM; used in <a href="https://arxiv.org/abs/1811.04551">PlaNet</a>, <a href="https://arxiv.org/abs/1805.12114">PETS</a>, and <a href="https://arxiv.org/abs/1610.00696">visual</a> <a href="https://arxiv.org/abs/1812.00568">foresight</a>) or <a href="https://arxiv.org/abs/1509.01149">path integral optimal control</a> (used in recent model-based <a href="https://arxiv.org/abs/1909.11652">dexterous manipulation</a> work).
</p>
<p>
In discrete-action settings, however, it is more common to search over tree structures than to iteratively refine a single trajectory of waypoints. Common tree-based search algorithms include <a href="https://hal.inria.fr/inria-00116992/document">MCTS</a>, which has underpinned recent impressive results in <a href="https://arxiv.org/abs/1712.01815">games</a> <a href="https://arxiv.org/abs/1705.08439">playing</a>, and <a href="https://www.ijcai.org/Proceedings/15/Papers/230.pdf">iterated width search</a>. Sampling-based planning, in both continuous and discrete domains, can also be combined with <a href="https://arxiv.org/abs/1904.03177">structured</a> <a href="https://arxiv.org/abs/1907.09620">physics-based</a>, <a href="https://arxiv.org/abs/1910.12827">object-centric</a> priors.
</p>
<p>
<strong>Model-based data generation</strong>
</p>
<p>
An important detail in many <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf#page=5">machine learning success stories</a> is a means of artificially increasing the size of a training set. It is difficult to define a manual data augmentation procedure for policy optimization, but we can view a predictive model analogously as a learned method of generating synthetic data. The original proposal of such a combination comes from the <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.51.7362&rep=rep1&type=pdf">Dyna algorithm</a> by Sutton, which alternates between model learning, data generation under a model, and policy learning using the model data. This strategy has been combined with <a href="https://arxiv.org/abs/1603.00748">iLQG</a>, <a href="https://arxiv.org/abs/1802.10592">model ensembles</a>, and <a href="https://arxiv.org/abs/1809.05214">meta-learning</a>; has been scaled <a href="https://arxiv.org/abs/1506.07365">to</a> <a href="https://arxiv.org/abs/1803.10122">image</a> <a href="https://arxiv.org/abs/1903.00374">observations</a>; and is amenable to <a href="https://arxiv.org/abs/1807.03858">theoretical analysis</a>. A close cousin to model-based data generation is the use of a model to improve <a href="https://arxiv.org/abs/1803.00101">target</a> <a href="https://arxiv.org/abs/1807.01675">value</a> estimates for temporal difference learning.
</p>
<p>
<strong>Value-equivalence prediction</strong>
</p>
<p>
A final technique, which does not fit neatly into model-based versus model-free categorization, is to incorporate computation that resembles <a href="https://arxiv.org/abs/1810.13400">model-based planning</a> without supervising the model’s predictions to resemble actual states. Instead, plans under the model are constrained to match trajectories in the real environment only in their predicted cumulative reward. These <a href="https://arxiv.org/abs/1602.02867">value-equivalent models</a> have shown to be effective in high-dimensional <a href="https://arxiv.org/abs/1707.03497">observation</a> <a href="https://arxiv.org/abs/1911.08265">spaces</a> where conventional model-based planning has proven difficult.
</p>
<!-- begin section III: trade-offs -->
<h2 id="trade-offs-of-model-data">Trade-offs of model data</h2>
<p>
In what follows, we will focus on the data generation strategy for model-based reinforcement learning. It is not obvious whether incorporating model-generated data into an otherwise model-free algorithm is a good idea. Modeling errors could cause <a href="https://arxiv.org/abs/1906.05243">diverging temporal-difference updates</a>, and in the case of linear approximation, <a href="https://users.cs.duke.edu/~parr/icml08.pdf">model and value fitting are equivalent</a>. However, it is easier to motivate model usage by considering the empirical generalization capacity of predictive models, and such a model-based augmentation procedure turns out to be surprisingly effective in practice.
</p>
<p>
<strong>The Good News</strong>
</p>
<p>
A natural way of thinking about the effects of model-generated data begins with the standard objective of reinforcement learning:
</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~janner/mbpo/blog/figures/objective.png" width="50%" />
</p>
<p>
which says that we want to maximize the expected cumulative discounted rewards \(r(s_t, a_t)\) from acting according to a policy \(\pi\) in an environment governed by dynamics \(p\). It is important to pay particular attention to the distributions over which this expectation is taken.<sup id="fnref:initial-distribution"><a href="#fn:initial-distribution" class="footnote"><font size="-2">2</font></a></sup> For example, while the expectation is supposed to be taken over trajectories from the current policy \(\pi\), in practice many algorithms re-use trajectories from an old policy \(\pi_\text{old}\) for improved sample-efficiency. There has been much <a href="https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs">algorithm</a> <a href="https://arxiv.org/abs/1606.02647">development</a> dedicated to correcting for the issues associated with the resulting <i>off-policy error</i>.
</p>
<p>
Using model-generated data can also be viewed as a simple modification of the sampling distribution. Incorporating model data into policy optimization amounts to swapping out the true dynamics \(p\) with an approximation \(\hat{p}\). The <i>model bias</i> introduced by making this substitution acts analogously to the off-policy error, but it allows us to do something rather useful: we can query the model dynamics \(\hat{p}\) at any state to generate samples from the current policy, effectively circumventing the off-policy error.
</p>
<p>
If model usage can be viewed as trading between off-policy error and model bias, then a straightforward way to proceed would be to compare these two terms. However, estimating a model’s error on the <i>current</i> policy’s distribution requires us to make a statement about how that model will generalize. While worst-case bounds are rather pessimistic here, we found that predictive models tend to generalize to the state distributions of future policies well enough to motivate their usage in policy optimization.
</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~janner/mbpo/blog/figures/generalization.png" width="75%" />
<br />
<i>Generalization of learned models, trained on samples from a data-collecting policy \(\pi_D\) , to the state distributions of future policies \(\pi\) seen during policy optimization. Increasing the training set size not only improves performance on the training distribution, but also on nearby distributions.</i>
</p>
<p>
<strong>The Bad News</strong>
</p>
<p>
The above result suggests that the single-step predictive accuracy of a learned model can be reliable under policy shift. The catch is that most model-based algorithms rely on models for much more than single-step accuracy, often performing model-based rollouts equal in length to the task horizon in order to properly estimate the state distribution under the model. When predictions are strung together in this manner, small <a href="https://arxiv.org/abs/1905.13320">errors</a> <a href="https://arxiv.org/abs/1612.06018">compound</a> over the prediction horizon.
</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~janner/mbpo/blog/figures/mbpo_hopper_loop.gif" width="100%" />
<br />
<i>A 450-step action sequence rolled out under a learned probabilistic model, with the figure’s position depicting the mean prediction and the shaded regions corresponding to one standard deviation away from the mean. The growing uncertainty and deterioration of a recognizable sinusoidal motion underscore accumulation of model errors.</i>
</p>
<!-- <video width=100% height=auto autoplay playsinline muted>
<source src="https://people.eecs.berkeley.edu/~janner/mbpo/blog/figures/mbpo_hopper_loop.gif" type="video/mp4">
</video> -->
<p>
<strong>Analyzing the trade-off</strong>
</p>
<p>
This qualitative trade-off can be made more precise by writing a lower bound on a policy’s true return in terms of its model-estimated return:
</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~janner/mbpo/blog/figures/bound.png" width="75%" />
<br />
<i>A lower bound on a policy’s
<span style="background-color: rgba(104,204,255,.5)">true return</span> in terms of its expected <span style="background-color: rgba(255,199,70,.5)">model return</span>, the <span style="background-color: rgba(255,109,110,.5)">model rollout length</span>, the <span style="background-color: rgba(106,225,106,.5)">policy divergence</span>, and the <span style="background-color: rgba(242,126,48,.5)">model error</span> on the current policy’s state distribution.</i>
</p>
<p>
As expected, there is a tension involving the model rollout length. The model serves to reduce off-policy error via the terms exponentially decreasing in the rollout length \(k\). However, increasing the rollout length also brings about increased discrepancy proportional to the model error.
</p>
<!-- begin section IV: MBPO -->
<h2 id="model-based-policy-optimization">Model-based policy optimization</h2>
<p>
We have two main conclusions from the above results:
</p>
<ol>
<li>predictive models can generalize well enough for the incurred model bias to be worth the reduction in off-policy error, but</li>
<li>compounding errors make long-horizon model rollouts unreliable.</li>
</ol>
<p>
A simple recipe for combining these two insights is to use the model only to perform short rollouts from all previously encountered real states instead of full-length rollouts from the initial state distribution. Variants of this procedure have been studied in prior works dating back to the classic <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.51.7362&rep=rep1&type=pdf">Dyna algorithm</a>, and we will refer to it generically as model-based policy optimization (MBPO), which we summarize in the pseudo-code below.
</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~janner/mbpo/blog/figures/pseudo_code.png" width="75%" />
</p>
<p>
We found that this simple procedure, combined with a few important design decisions like using <a href="https://arxiv.org/abs/1805.12114">probabilistic model ensembles</a> and a <a href="https://arxiv.org/abs/1801.01290">stable off-policy model-free optimizer</a>, yields the best combination of sample efficiency and asymptotic performance. We also found that MBPO avoids the pitfalls that have prevented recent model-based methods from scaling to higher-dimensional states and long-horizon tasks.
</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~janner/mbpo/blog/figures/consolidated.png" width="75%" />
<br />
<i>Learning curves of MBPO and five prior works on continuous control benchmarks. MBPO reaches the same asymptotic performance as the best model-free algorithms, often with only one-tenth of the data, and scales to state dimensions and horizon lengths that cause previous model-based algorithms to fail.</i>
</p>
<hr />
<p>
This post is based on the following paper:
</p>
<ul>
<li>
<a href="https://arxiv.org/abs/1906.08253"><strong>When to Trust Your Model: Model-Based Policy Optimization</strong></a>
<br />
<a href="https://people.eecs.berkeley.edu/~janner/">Michael Janner</a>, <a href="https://people.eecs.berkeley.edu/~justinjfu/">Justin Fu</a>, <a href="http://marvinzhang.com/">Marvin Zhang</a>, and <a href="https://people.eecs.berkeley.edu/~svlevine/">Sergey Levine</a>
<br />
<em>Neural Information Processing Systems (NeurIPS), 2019.</em>
<br />
<a href="https://github.com/JannerM/mbpo">Open-source code</a>
</li>
</ul>
<p>
<em>I would like to thank Michael Chang and Sergey Levine for their valuable feedback.</em>
</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:naming-conventions">
<p>
In reinforcement learning, this variable is typically denoted by <i>a</i> for “action.” In control theory, it is denoted by <i>u</i> for “upravleniye” (or more faithfully, “управление”), which I am told is “control” in Russian.<a href="#fnref:naming-conventions" class="reversefootnote">↩</a>
</p>
</li>
<li id="fn:initial-distribution">
<p>
We have omitted the initial state distribution \(s_0 \sim \rho(\cdot)\) to focus on those distributions affected by incorporating a learned model.<a href="#fnref:initial-distribution" class="reversefootnote">↩</a>
</p>
</li>
</ol>
</div>
<hr />
<p>
<font size="-1">
<strong>References</strong>
<ol>
<li>KR Allen, KA Smith, and JB Tenenbaum. <a href="https://arxiv.org/abs/1907.09620">The tools challenge: rapid trial-and-error learning in physical problem solving.</a> CogSci 2019.</li>
<li>B Amos, IDJ Rodriguez, J Sacks, B Boots, JZ Kolter. <a href="https://arxiv.org/abs/1810.13400">Differentiable MPC for end-to-end planning and control.</a> NeurIPS 2018.</li>
<li>T Anthony, Z Tian, and D Barber. <a href="https://arxiv.org/abs/1705.08439">Thinking fast and slow with deep learning and tree search.</a> NIPS 2017.</li>
<li>K Asadi, D Misra, S Kim, and ML Littman. <a href="https://arxiv.org/abs/1905.13320">Combating the compounding-error problem with a multi-step model.</a> arXiv 2019.</li>
<li>V Bapst, A Sanchez-Gonzalez, C Doersch, KL Stachenfeld, P Kohli., PW Battaglia, and JB Hamrick. <a href="https://arxiv.org/abs/1904.03177">Structured agents for physical construction.</a> ICML 2019.</li>
<li>ZI Botev, DP Kroese, RY Rubinstein, and P L’Ecuyer. <a href="https://www.sciencedirect.com/science/article/pii/B9780444538598000035">The cross-entropy method for optimization.</a> Handbook of Statistics, volume 31, chapter 3. 2013.</li>
<li>J Buckman, D Hafner, G Tucker, E Brevdo, and H Lee. <a href="https://arxiv.org/abs/1807.01675">Sample-efficient reinforcement learning with stochastic ensemble value expansion.</a> NeurIPS 2018.</li>
<li>K Chua, R Calandra, R McAllister, and S Levine. <a href="https://arxiv.org/abs/1805.12114">Deep reinforcement learning in a handful of trials using probabilistic dynamics models.</a> NeurIPS 2018.</li>
<li>I Clavera, J Rothfuss, J Schulman, Y Fujita, T Asfour, and P Abbeel. <a href="https://arxiv.org/abs/1809.05214">Model-based reinforcement learning via meta-policy optimization.</a> CoRL 2018.</li>
<li>R Coulom. <a href="https://hal.inria.fr/inria-00116992/document">Efficient selectivity and backup operators in Monte-Carlo tree search.</a> CG 2006.</li>
<li>M Deisenroth and CE Rasmussen. <a href="http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf">PILCO: A model-based and data-efficient approach to policy search.</a> ICML 2011.</li>
<li>F Ebert, C Finn, S Dasari, A Xie, A Lee, and S Levine. <a href="https://arxiv.org/abs/1812.00568">Visual foresight: model-based deep reinforcement learning for vision-based robotic control.</a> arXiv 2018.</li>
<li>V Feinberg, A Wan, I Stoica, MI Jordan, JE Gonzalez, and S Levine. <a href="https://arxiv.org/abs/1803.00101">Model-based value estimation for efficient model-free reinforcement learning.</a> ICML 2018.</li>
<li>C Finn and S Levine. <a href="https://arxiv.org/abs/1610.00696">Deep visual foresight for planning robot motion.</a> ICRA 2017.</li>
<li>S Gu, T Lillicrap, I Sutskever, and S Levine. <a href="https://arxiv.org/abs/1603.00748">Continuous deep Q-learning with model-based acceleration.</a> ICML 2016.</li>
<li>D Ha and J Schmidhuber. <a href="https://arxiv.org/abs/1803.10122">World models.</a> NeurIPS 2018.</li>
<li>T Haarnoja, A Zhou, P Abbeel, and S Levine. <a href="https://arxiv.org/abs/1801.01290">Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor.</a> ICML 2018.</li>
<li>D Hafner, T Lillicrap, I Fischer, R Villegas, D Ha, H Lee, and J Davidson. <a href="https://arxiv.org/abs/1811.04551">Learning latent dynamics for planning from pixels.</a> ICML 2019.</li>
<li>LP Kaelbling, ML Littman, and AP Moore. <a href="https://www.cs.cmu.edu/~tom/10701_sp11/slides/Kaelbling.pdf#page=15">Reinforcement learning: a survey.</a> JAIR 1996.</li>
<li>L Kaiser, M Babaeizadeh, P Milos, B Osinski, RH Campbell, K Czechowski, D Erhan, C Finn, P Kozakowsi, S Levine, R Sepassi, G Tucker, and H Michalewski. <a href="https://arxiv.org/abs/1903.00374">Model-based reinforcement learning for Atari.</a> arXiv 2019.</li>
<li>A Krizhevsky, I Sutskever, and GE Hinton. <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf#page=5">ImageNet classification with deep convolutional neural networks.</a> NIPS 2012.</li>
<li>T Kurutach, I Clavera, Y Duan, A Tamar, and P Abbeel. <a href="https://arxiv.org/abs/1802.10592">Model-ensemble trust-region policy optimization.</a> ICLR 2018.</li>
<li>S Levine and V Koltun. <a href="https://graphics.stanford.edu/projects/gpspaper/gps_full.pdf">Guided policy search.</a> ICML 2013.</li>
<li>W Li and E Todorov. <a href="https://homes.cs.washington.edu/~todorov/papers/LiICINCO04.pdf">Iterative linear quadratic regulator design for nonlinear biological movement systems.</a> ICINCO 2004.</li>
<li>N Lipovetzky, M Ramirez, and H Geffner. <a href="https://www.ijcai.org/Proceedings/15/Papers/230.pdf">Classical planning with simulators: results on the Atari video games.</a> IJCAI 2015.</li>
<li>Y Luo, H Xu, Y Li, Y Tian, T Darrell, and T Ma. <a href="https://arxiv.org/abs/1807.03858">Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees.</a> ICLR 2019.</li>
<li>R Munos, T Stepleton, A Harutyunyan, MG Bellemare. <a href="https://arxiv.org/abs/1606.02647">Safe and efficient off-policy reinforcement learning.</a> NIPS 2016.</li>
<li>A Nagabandi, K Konoglie, S Levine, and V Kumar. <a href="https://arxiv.org/abs/1909.11652">Deep dynamics models for learning dexterous manipulation.</a> arXiv 2019.</li>
<li>A Nagabandi, GS Kahn, R Fearing, and S Levine. <a href="https://arxiv.org/abs/1708.02596">Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning.</a> ICRA 2018.</li>
<li>J Oh, S Singh, and H Lee. <a href="https://arxiv.org/abs/1707.03497">Value prediction network.</a> NIPS 2017.</li>
<li>R Parr, L Li, G Taylor, C Painter-Wakefield, ML Littman. <a href="https://users.cs.duke.edu/~parr/icml08.pdf">An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning.</a> ICML 2008.</li>
<li>D Precup, R Sutton, and S Singh. <a href="http://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs">Eligibility traces for off-policy policy evaluation.</a> ICML 2000.</li>
<li>J Schrittwieser, I Antonoglou, T Hubert, K Simonyan, L Sifre, S Schmitt, A Guez, E Lockhart, D Hassabis, T Graepel, T Lillicrap, and D Silver. <a href="https://arxiv.org/abs/1911.08265">Mastering Atari, Go, chess and shogi by planning with a learned model.</a> arXiv 2019.</li>
<li>D Silver, T Hubert, J Schrittwieser, I Antonoglou, M Lai, A Guez, M Lanctot, L Sifre, D Ku-maran, T Graepel, TP Lillicrap, K Simonyan, and D Hassabis. <a href="https://arxiv.org/abs/1712.01815">Mastering chess and shogi by self-play with a general reinforcement learning algorithm.</a> arXiv 2017.</li>
<li>RS Sutton. <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.51.7362&rep=rep1&type=pdf">Integrated architectures for learning, planning, and reacting based on approximating dynamic programming.</a> ICML 1990.</li>
<li>E Talvitie. <a href="https://arxiv.org/abs/1612.06018">Self-correcting models for model-based reinforcement learning.</a> AAAI 2016.</li>
<li>A Tamar, Y Wu, G Thomas, S Levine, and P Abbeel. <a href=https://arxiv.org/abs/1602.02867>Value iteration networks.</a> NIPS 2016.</li>
<li>Y Tassa, T Erez, and E Todorov. <a href="https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf">Synthesis and stabilization of complex behaviors through online trajectory optimization.</a> IROS 2012.</li>
<li>H van Hasselt, M Hessel, and J Aslanides. <a href="https://arxiv.org/abs/1906.05243">When to use parametric models in reinforcement learning?</a> NeurIPS 2019.</li>
<li>R Veerapaneni, JD Co-Reyes, M Chang, M Janner, C Finn, J Wu, JB Tenenbaum, and S Levine. <a href="https://arxiv.org/abs/1910.12827">Entity abstraction in visual model-based reinforcement learning.</a> CoRL 2019.</li>
<li>T Wang, X Bao, I Clavera, J Hoang, Y Wen, E Langlois, S Zhang, G Zhang, P Abbeel, and J Ba. <a href="https://arxiv.org/abs/1907.02057">Benchmarking model-based reinforcement learning.</a> arXiv 2019.</li>
<li>M Watter, JT Springenberg, J Boedecker, M Riedmiller. <a href="https://arxiv.org/abs/1506.07365">Embed to control: a locally linear latent dynamics model for control from raw images.</a> NIPS 2015.</li>
<li>G Williams, A Aldrich, and E Theodorou. <a href="https://arxiv.org/abs/1509.01149">Model predictive path integral control using covariance variable importance sampling.</a> arXiv 2015.</li>
</ol>
</font>
</p>
</article>
Thu, 12 Dec 2019 01:00:00 -0800
https://bairblog.github.io/2019/12/12/mbpo/
https://bairblog.github.io/2019/12/12/mbpo/Data-Driven Deep Reinforcement Learning<p>One of the primary factors behind the success of machine learning approaches in open world settings, such as <a href="https://arxiv.org/abs/1512.03385">image recognition</a> and <a href="https://arxiv.org/abs/1810.04805">natural language processing</a>, has been the ability of high-capacity deep neural network function approximators to learn generalizable models from large amounts of data. Deep reinforcement learning methods, however, require active online data collection, where the model actively interacts with its environment. This makes such methods hard to scale to complex real-world problems, where active data collection means that large datasets of experience must be collected for every experiment – this can be expensive and, for systems such as autonomous vehicles or robots, potentially <a href="https://arxiv.org/abs/1801.08757">unsafe</a>. In a number of domains of practical interest, such as autonomous driving, robotics, and games, there exist plentiful amounts of previously collected interaction data which, consists of informative behaviours that are a rich source of prior information. Deep RL algorithms that can utilize such prior datasets will not only scale to real-world problems, but will also lead to solutions that generalize substantially better. A <strong>data-driven</strong> paradigm for reinforcement learning will enable us to pre-train and deploy agents capable of sample-efficient learning in the real-world.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575540575850_off_policy_teaser.png" width="" />
</p>
<p>In this work, we ask the following question: <em>Can deep RL algorithms effectively leverage prior collected offline data and learn without interaction with the environment?</em> We refer to this problem statement as <em>fully off-policy RL</em>, previously also called <em>batch RL</em> in literature. A class of deep RL algorithms, known as off-policy RL algorithms can, in principle, learn from previously collected data. Recent off-policy RL algorithms such as <a href="https://arxiv.org/abs/1812.05905">Soft Actor-Critic</a> (SAC), <a href="https://arxiv.org/abs/1806.10293">QT-Opt</a>, and <a href="https://arxiv.org/abs/1710.02298">Rainbow</a>, have demonstrated sample-efficient performance in a number of challenging domains such as robotic manipulation and atari games. However, all of these methods still require online data collection, and their ability to learn from fully off-policy data is limited in practice. In this work, we show why existing deep RL algorithms can fail in the fully off-policy setting. We then propose effective solutions to mitigate these issues.</p>
<!--more-->
<h2 id="why-cant-off-policy-deep-rl-algorithms-learn-from-static-datasets">Why can’t off-policy deep RL algorithms learn from static datasets?</h2>
<p>Let’s first study how state-of-the-art deep RL algorithms perform in the fully
off-policy setting. We choose the Soft Actor-critic (SAC) algorithm and
investigate its performance in the fully off-policy setting. Figure 1 shows the
training curve for SAC trained solely on varying amounts of previously
collected <em>expert demonstrations</em> for the HalfCheetah-v2 gym benchmark task.
Although the data demonstrates successful task completion, none of these runs
succeed, with corresponding Q-values diverging in some cases. At first glance,
this resembles <em><strong>overfitting</strong>,</em> as the evaluation performance deteriorates
with more training (green curve), but increasing the size of the static dataset
does not rectify the problem (orange (1e4 samples) vs green (1e5 samples)),
suggesting the issue is more complex.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575540563406_returns_cheetah.png" width="600" />
<br />
<i>
Figure 1: Average Return and logarithm of the Q-value for HalfCheetah-v2 with varying amounts of expert data.
</i>
</p>
<p>Most off-policy RL methods in use today, including SAC used in Figure 1, are
based on approximate dynamic programming (though many also utilize importance
sampled policy gradients, for example <a href="http://auai.org/uai2019/proceedings/papers/440.pdf">Liu et al.
(2019)</a>). The core
component of approximate dynamic programming in deep RL is the value function
or Q-function. The <em>optimal</em> Q-function <script type="math/tex">Q^*</script> obeys the optimal Bellman
equation, given below:</p>
<script type="math/tex; mode=display">Q^* = \mathcal{T}^* Q^* \;\;\; \mbox{;} \;\;\; (\mathcal{T}^* \hat{Q})(s, a) := R(s, a) + \gamma \mathbb{E}_{T(s'|s,a)}[\max_{a'}\hat{Q}(s', a')]</script>
<p>Reinforcement learning then corresponds to minimizing the squared difference
between the left-hand side and right-hand side of this equation, also referred
to as the mean squared Bellman error (MSBE):</p>
<script type="math/tex; mode=display">Q := \arg \min_{\hat{Q}} \mathbb{E}_{s \sim \mathcal{D}, a \sim \beta(a|s)} \left[ (\hat{Q}(s, a) - (\mathcal{T}^* \hat{Q})(s, a))^2 \right]</script>
<p>MSBE is minimized on transition samples in a dataset <script type="math/tex">\mathcal{D}</script> generated
by a behavior policy <script type="math/tex">\beta(a|s)</script>. Although minimizing MSBE corresponds to a
supervised regression problem, the targets for this regression are themselves
derived from the current Q-function estimate.</p>
<p>We can understand the source of the instability shown in Figure 1 by examining
the form of the Bellman backup described above. The targets are calculated by
maximizing the learned Q-values with respect to the action at the next state
(<script type="math/tex">s'</script>) for Q-learning methods, or by computing an expected value under the
policy at the next state, <script type="math/tex">\mathbb{E}_{s' \sim T(s'|s, a), a' \sim \pi(a'|s')}
[\hat{Q}(s', a')]</script> for actor-critic methods that maintain an explicit policy
alongside a Q-function. However, the Q-function estimator is only reliable on
action-inputs from the behavior policy <script type="math/tex">\beta</script>, which is the training
distribution. As a result, naively maximizing the value may evaluate the
<script type="math/tex">\hat{Q}</script> estimator on actions that lie far outside of the training
distribution, resulting in pathological values that incur large absolute error
from the optimal desired Q-values. We refer to such actions as
out-of-distribution (OOD) actions. We call the error in the Q-values caused due
to backing up target values corresponding to OOD actions during backups as
bootstrapping error. A schematic of this problem is shown in Figure 2 below.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575540995462_problem.png" width="600" />
<br />
<i>
Figure 2: Incorrectly high Q-values for OOD actions may be used for backups, leading to accumulation of error.
</i>
</p>
<p>Using an incorrect Q-value for computing the backup leads to accumulation of
error – minimizing MSBE perfectly leads to propagation of all imperfections in
the target values into the current Q-function estimator. We refer to this
process as accumulation or <strong>propagation</strong> of bootstrapping error over
iterations of training. The agent is unable to correct errors as it is unable
to gather ground truth return information by actively exploring the
environment. Accumulation of bootstrapping error can make Q-values diverge to
infinity (for example, blue and green curves in Figure 1), or not converge to
the correct values (for example, red curve in Figure 1), leading to final
performance that is substantially worse than even the average performance in
the dataset.</p>
<h2 id="towards-deep-rl-algorithms-that-learn-in-the-presence-of-ood-actions">Towards deep RL algorithms that learn in the presence of OOD actions</h2>
<p><em>How can we develop RL algorithms that learn from static data without being affected by OOD actions?</em> We will first review some of the existing approaches in literature towards solving this problem and then describe our recent work, BEAR which tackles this problem. There are broadly two classes of methods towards solving this problem.</p>
<ul>
<li><strong>Behavioral cloning (BC) based methods:</strong> When the static dataset is generated by an expert, one can utilize behavioral cloning to mimic the expert policy, as is done in imitation learning. In a generalized setting, where the behavior policy can be suboptimal but reward information is accessible, one can choose to only mimic a subset of good action decisions from the entire dataset. This is the idea behind prior works such as <a href="http://is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/ICML2007-Peters_4493[0].pdf">reward-weighted regression</a> (RWR), <a href="https://www.aaai.org/ocs/index.php/AAAI/AAAI10/paper/viewFile/1851/2264">relative entropy policy search</a> (REPS), <a href="https://papers.nips.cc/paper/7866-exponentially-weighted-imitation-learning-for-batched-historical-data">MARWIL</a>, some very recent works such as <a href="https://openreview.net/forum?id=rke7geHtwH">ABM</a>, and a recent work from our lab, called <a href="https://arxiv.org/abs/1910.00177">advantage weighted regression</a> (AWR) where we show that advantage-weighted form of behavioral cloning, which assigns higher likelihoods to demonstration actions that receive higher advantages, can also be used in such situations. Such a method trains only on actions observed in the dataset, hence avoids OOD actions completely.</li>
<li><strong>Dynamic programming (DP) methods:</strong> Dynamic programming methods are appealing in fully off-policy RL scenarios because of their ability to pool information across trajectories, unlike BC-based methods that are implicitly constrained to lie in the vicinity of the best performing trajectory in the static dataset. For example, Q-iteration on a dataset consisting of all transitions in an MDP should return the optimal policy at convergence, however previously described BC-based methods may fail to recover optimality if the individual trajectories are highly suboptimal. Within this class, some recent work includes <a href="https://arxiv.org/abs/1812.02900">batch constrained Q-learning</a> (BCQ) that constrains the trained policy distribution to lie close to the behavior policy that generated the dataset. This is an optimal strategy when the static dataset is generated by an expert policy. However, this might be suboptimal if the data comes from an arbitrarily suboptimal policy. Other recent work, <a href="https://arxiv.org/abs/1512.08562">G-Learning</a>, <a href="https://arxiv.org/abs/1907.00456">KL-Control</a>, <a href="https://arxiv.org/abs/1911.11361">BRAC</a> implements closeness to the behavior policy by solving a KL-constrained RL problem. <a href="https://arxiv.org/abs/1712.06924">SPIBB</a> selectively constrains the learned policy to match the behavior policy in probability density on less frequent actions.</li>
</ul>
<p>The key question we pose in our work is: <strong>Which policies can be reliably used
for backups without backing up OOD actions?</strong> Once this question is answered,
the job of an RL algorithm reduces to picking the best policy in this set. In
our work, we provide a theoretical characterization of this set of policies and
use insights from theory to propose a practical dynamic programming based deep
RL algorithm called <a href="https://arxiv.org/abs/1906.00949">BEAR</a> that learns from
purely static data.</p>
<h2 id="bootstrapping-error-accumulation-reduction-bear">Bootstrapping Error Accumulation Reduction (BEAR)</h2>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575540392160_all_three_cases.png" width="" />
<br />
<i>
Figure 3: Illustration of support constraint (BEAR) (right) and distribution-matching constraint (middle).
</i>
</p>
<p>The key idea behind BEAR is to constrain the learned policy to lie <em>within the
support</em> (Figure 3, right) of the behavior policy distribution. This is in
contrast to distribution matching (Figure 3, middle) – BEAR does not constrain
the learned policy to be close in distribution to the behavior policy, but only
requires that the learned policy places non-zero probability mass on actions
with non-negligible behavior policy density. We refer to this as <strong>support
constraint.</strong> As an example, in a setting with a uniform-at-random behavior
policy, a support constraint allows dynamic programming to learn an optimal,
deterministic policy. However, a distribution-matching constraint will require
that the learned policy be highly stochastic (and thus not optimal), for
instance in Figure 3, middle, the learned policy is constrained to be one of
the stochastic purple policies, however in Figure 3, right, the learned policy
can be a (near-)deterministic yellow policy. For the readers interested in
theory, the theoretical insight behind this choice is that a support constraint
enables us to control error propagation by upper bounding
<a href="https://www.aaai.org/Papers/AAAI/2005/AAAI05-159.pdf">concentrability</a> under
the learned policy, while providing the capacity to reduce divergence from the
optimal policy.</p>
<p>How do we enforce that the learned policy satisfies the support constraint? In
practice, we use the sampled <a href="http://jmlr.csail.mit.edu/papers/v13/gretton12a.html">Maximum Mean
Discrepancy</a> (MMD)
distance between actions as a measure of support divergence. Letting <script type="math/tex">X =
\{x_1, \cdots, x_n\}</script>, <script type="math/tex">Y = \{y_1, \cdots, y_n\}</script> and $k$ be any RBF kernel, we have:</p>
<script type="math/tex; mode=display">\text{MMD}^2(X, Y) = \frac{1}{n^2} \sum_{i, i'} k(x_i, x_{i'}) - \frac{2}{nm} \sum_{i, j} k(x_i, y_j) + \frac{1}{m^2} \sum_{j, j'} k(y_j, y_{j'}).</script>
<p>A simple code snippet for computing MMD is shown below:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">gaussian_kernel</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">sigma</span><span class="o">=</span><span class="mf">0.1</span><span class="p">):</span>
<span class="k">return</span> <span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">sigma</span><span class="o">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)))</span>
<span class="k">def</span> <span class="nf">compute_mmd</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="n">k_x_x</span> <span class="o">=</span> <span class="n">gaussian_kernel</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="n">k_x_y</span> <span class="o">=</span> <span class="n">gaussian_kernel</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">k_y_y</span> <span class="o">=</span> <span class="n">gaussian_kernel</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">return</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">k_x_x</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="o">+</span> <span class="n">k_y_y</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="o">-</span> <span class="mi">2</span><span class="o">*</span><span class="n">k_x_y</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
</code></pre></div></div>
<p><script type="math/tex">\text{MMD}</script> is amenable to stochastic gradient-based training and we show
that computing <script type="math/tex">\text{MMD}(P, Q)</script> using only few samples from both
distributions <script type="math/tex">P</script> and <script type="math/tex">Q</script> provides sufficient signal to quantify
differences in support but not in probability density, hence making it a
preferred measure for implementing the support constraint. To sum up, the new
(constrained-)policy improvement step in an actor-critic setup is given by:</p>
<script type="math/tex; mode=display">\pi_\phi := \max_{\pi \in \Delta_{|S|}} \mathbb{E}_{s \sim \mathcal{D}} \mathbb{E}_{a \sim \pi(\cdot|s)} \left[ Q_\theta(s, a)\right]
\quad \mbox{s.t.} \quad
\mathbb{E}_{s \sim \mathcal{D}} [\text{MMD}(\beta(\cdot|s), \pi(\cdot|s))] \leq \varepsilon</script>
<p><strong>Support constraint vs Distribution-matching constraint</strong>
Some works, for example, <a href="https://arxiv.org/abs/1907.00456">KL-Control</a>, <a href="https://arxiv.org/abs/1911.11361">BRAC</a>, <a href="https://arxiv.org/abs/1512.08562">G-Learning</a>, argue that using a distribution-matching constraint might suffice in such fully off-policy RL problems. In this section, we take a slight detour towards analyzing this choice. In particular, we provide an instance of an MDP where distribution-matching constraint might lead to arbitrarily suboptimal behavior while support matching does not suffer from this issue.</p>
<p>Consider the 1D-lineworld MDP shown in Figure 4 below. Two actions (left and right) are available to the agent at each state. The agent is tasked with reaching to the goal state <script type="math/tex">G</script>, starting from state <script type="math/tex">S</script> and the corresponding per-step reward values are shown in Figure 4(a). The agent is only allowed to learn from behavior data generated by a policy that performs actions with probabilities described in Figure 4(b), and in particular, this behavior policy executes the suboptimal action at states in-between S and G with a high likelihood of 0.9, however, both actions <script type="math/tex">\leftarrow</script> and <script type="math/tex">\rightarrow</script> are in-distribution at all these states.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575585059655_gridworld_separate_init.png" width="" />
<br />
<i>
Figure 4: Example 1D lineworld and the corresponding behavior policy.
</i>
</p>
<p>In Figure 5(a), we show that the learned policy with a distribution-matching constraint can be arbitrarily suboptimal, infact, the probability of reaching goal G by rolling out this policy is very small, and tends to 0 as the environment is made larger. However, in Figure 5(b), we show that a support constraint can recover an optimal policy with probability 1.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575585070806_policies_learned.png" width="" />
<br />
<i>
Figure 5: Policies learned via distribution-matching and support-matching in the 1D lineworld shown in Figure 4.
</i>
</p>
<p>Why does distribution-matching fail here? Let us analyze the case when we use a penalty for distribution-matching. If the penalty is enforced tightly, then the agent will be forced to mainly execute the wrong action (<script type="math/tex">\leftarrow</script>) in states between <script type="math/tex">S</script> and <script type="math/tex">G</script>, leading to suboptimal behavior. However, if the penalty is not enforced tightly, with the intention of achieving a better policy than the behavior policy, the agent will perform backups using the OOD-action <script type="math/tex">\rightarrow</script> at states to the left of <script type="math/tex">S</script>, and these backups will eventually affect the Q-value at state <script type="math/tex">S</script>. This phenomenon will lead to an incorrect Q-function, and hence a wrong policy – possibly, one that goes to the left starting from <script type="math/tex">S</script> instead of moving towards <script type="math/tex">G</script>, as OOD-action backups combined with overestimation bias in Q-learning might make action <script type="math/tex">\leftarrow</script> at state <script type="math/tex">S</script> look more preferable. Figure 6 shows that some states need a strong penalty/constraint (to prevent OOD backups) and the others require a weak penalty/constraint (to achieve optimality) for distribution-matching to work, however, this cannot be achieved via conventional distribution-matching approaches.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575585085723_gridworld_analysis.png" width="600" />
<br />
<i>
Figure 6: Analysis of the strength of distribution-matching constraint needed at different states.
</i>
</p>
<!--
**Support constraint vs Distribution-matching constraint**
Some works, for example, [KL-Control](https://arxiv.org/abs/1907.00456), [BRAC](https://arxiv.org/abs/1911.11361), [G-Learning](https://arxiv.org/abs/1512.08562), argue that using a distribution-matching constraint might suffice in such fully off-policy RL problems. In this section, we take a slight detour towards analyzing this choice. In particular, we provide an instance of an MDP where distribution-matching constraint might lead to arbitrarily suboptimal behavior while support matching does not suffer from this issue.
Consider the 1D-lineworld MDP shown in Figure 4 below. Two actions (left and right) are available to the agent at each state. The agent is tasked with reaching to the goal state $$G$$, starting from state $$S$$ and the corresponding per-step reward values are shown in Figure 4, left. The agent is only allowed to learn from behavior data generated by a policy that performs actions with probabilities described in Figure 3a, and in particular, this behavior policy executes the suboptimal action at states in-between S and G with a high likelihood of 0.9, however, both actions $$\leftarrow$$ and $$\rightarrow$$ are in-distribution at all these states. In Figure 4c, we show that the learned policy with a distribution-matching constraint can be arbitrarily suboptimal, infact, the probability of reaching goal G by rolling out this policy is very small, and tends to 0 as the environment is made larger. However, in Figure 4d, we show that a support constraint can recover an optimal policy with probability 1.
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575498994140_gridworld_example.png" width="">
<br />
<i>
Figure 4: Example 1D lineworld to illustrate the difference between support-constraint and distribution-matching constraint.
</i>
</p>
Why does this distribution-matching fail here? Let us imagine using a penalty
for distribution-matching. If the penalty is enforced tightly, then the agent
will be forced to mainly execute the wrong action in states between $$S$$ and
$$G$$, leading to suboptimal behavior. However, if the penalty is not enforced
tightly, with the objective of achieving more optimal behavior, this will lead
to backups from the OOD-action '$$\rightarrow$$' at states to the left of
$$S$$, which can lead to a completely wrong Q-function, and hence a wrong
policy -- for example, it can give rise to a policy that goes to the left
instead of the right. In Figure 4b, we demonstrate this issue by partitioning
states into groups that need a strong penalty/constraint and a group that needs
a weak penalty/constraint for distribution-matching to work, however, this
might not be achievable via conventional distribution-matching approaches.
-->
<h2 id="so-how-does-bear-perform-in-practice">So, how does BEAR perform in practice?</h2>
<p>In our experiments, we evaluated BEAR on three kinds of datasets generated by
– (1) a partially-trained medium-return policy, (2) a random low-return policy
and (3) an expert, high-return policy. (1) resembles the settings in practice
such as autonomous driving or robotics, where offline data is collected via
scripted policies for robotic grasping or consists of human driving data (which
may not be perfect) respectively. Such data is useful as it demonstrates
non-random, but still not optimal behavior and we expect training on offline
data to be most useful in this setting. Good performance on <em>both</em> (2) and (3)
demonstrates that the versatility of an algorithm to arbitrary dataset
compositions.</p>
<p>For each dataset composition, we compare BEAR to a number of baselines
including BC, BCQ, and deep Q-Learning from demonstrations
(<a href="https://arxiv.org/abs/1704.03732">DQfD</a>). In general, we find that BEAR
outperforms the best performing baseline in setting (1), and BEAR is the only
algorithm capable successfully learning a better-than-dataset policy in both
(2) and (3). We show some learning curves below. BC or BCQ type methods usually
do not perform great with random data, partly because of the usage of a
distribution-matching constraint as described earlier.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575194663346_Screen+Shot+2019-12-01+at+2.03.50+AM.png" width="" />
<br />
<i>
Figure 7: Performance on (1) medium-quality dataset: BEAR outperforms the best performing baseline.
</i>
</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575194824925_Screen+Shot+2019-12-01+at+2.06.48+AM.png" width="600" />
<br />
<i>
Figure 8: (3) Expert data: BEAR recovers the performance in the expert dataset, and performs similarly to other methods such as BC.
</i>
</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_03D8A88577B961181603AE5EDBD4A511CD8E828E7651B8AA640A61950DAB9783_1575194836848_Screen+Shot+2019-12-01+at+2.05.38+AM.png" width="600" />
<br />
<i>
Figure 9: (2) Random data: BEAR recovers better than dataset performance, and is close to the best performing algorithm (Naive RL).
</i>
</p>
<h2 id="future-directions-and-open-problems">Future Directions and Open Problems</h2>
<p>Most of the prior datasets for real-world problems such as <a href="https://www.robonet.wiki/">RoboNet</a> and <a href="https://github.com/TorchCraft/StarData">Starcraft replays</a> consist of multimodal behavior generated by different users and robots. Hence, one of the next steps to look at is learning from a diverse mixtures of policies. How can we effectively learn policies from a static dataset that consists of a diverse range of behavior – possibly interaction from a diverse range of tasks, in the spirit of what we encounter in the real world? This question is mostly unanswered at the moment. Some very recent work, such as <a href="https://arxiv.org/abs/1907.04543">REM</a>, shows that simple modifications to existing distributional off-policy RL algorithms in the Atari domain can enable fully off-policy learning on entire interaction data generated from the training run of a separate DQN agent. However, the best solution for learning from a dataset generated by any arbitrary mixture of policies – which is more likely the case in practical problems – is unclear.</p>
<p>A rigorous theoretical characterization of the best achievable policy as a function of a given dataset is also an open problem. In our paper, we analyze this question by looking at typically used assumptions of bounded concentrability in the error and convergence analysis of Fitted Q-iteration. Which other assumptions can be applied to analyze this problem? And which of these assumptions is least restrictive and practically feasible? What is the theoretical optimum of what can be achieved solely by offline training?</p>
<p>We hope that our work, BEAR, takes us a step closer to effectively leveraging the most out of prior datasets in an RL algorithm. A <strong>data-driven</strong> paradigm of RL where one could (pre-)train RL algorithms with large amounts of prior data will enable us to go beyond the active exploration bottleneck, thus giving us agents that can be deployed and keep learning continuously in the real world.</p>
<hr />
<p>This blog post is based on the our recent paper:</p>
<ul>
<li><strong>Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction</strong><br />
Aviral Kumar*, Justin Fu*, George Tucker, Sergey Levine<br />
<em>In Advances in Neural Information Processing Systems, 2019</em></li>
</ul>
<p>The <a href="https://arxiv.org/abs/1906.00949">paper</a> and code are available
<a href="https://github.com/aviralkumar2907/BEAR">online</a> and a slide-deck explaining
the algorithm is available
<a href="https://sites.google.com/view/bear-off-policyrl">here</a>. <em>I would like to thank
Sergey Levine for his valuable feedback on earlier versions of this blog post.</em></p>
Thu, 05 Dec 2019 01:00:00 -0800
https://bairblog.github.io/2019/12/05/bear/
https://bairblog.github.io/2019/12/05/bear/