The Berkeley Artificial Intelligence Research BlogThe BAIR Blog
https://bairblog.github.io/
The Successor Representation, $\gamma$-Models,<br> and Infinite-Horizon Prediction<!-- twitter -->
<meta name="twitter:title" content="The successor representation, γ-models, & infinite-horizon prediction" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/successor/twitter-card-0.98-01.png" />
<meta name="keywords" content="successor, representation, SR, gamma-models, gamma, models, reinforcement, learning, generative, temporal, difference" />
<meta name="description" content="The BAIR Blog" />
<meta name="author" content="Michael Janner" />
<title>The Successor Representation, Gamma-Models, and Infinite-Horizon Prediction</title>
<script>
function increment_img(id) {
element = document.getElementById(id)
src = element.src
length = src.length
ind_string = src.substring(length-6, length-4)
ind = parseInt(ind_string, 10)
next_ind = (ind+1) % 10
next_ind_string = next_ind.toString().padStart(2, '0')
next_src = src.replace(ind_string, next_ind_string)
element.src = next_src
}
setInterval(function() {
increment_img('rollout')
// console.log(src, next_src)
}, 1 * 1000);
</script>
<!-- begin section I: introduction -->
<p><br /></p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/successor/gamma-teaser.png" width="100%" />
<br />
<i>Standard single-step models have a horizon of one. This post describes a method for training predictive dynamics models in continuous state spaces with an infinite, probabilistic horizon.</i>
</p>
<p><br /></p>
<p>
Reinforcement learning algorithms are frequently categorized by whether they predict future states at any point in their decision-making process. Those that do are called <i>model-based</i>, and those that do not are dubbed <i>model-free</i>. This classification is so common that we mostly take it for granted these days; I am <a href="https://bair.berkeley.edu/blog/2019/12/12/mbpo/">guilty of using it myself</a>. However, this distinction is not as clear-cut as it may initially seem.
</p>
<p>
In this post, I will talk about an alternative view that emphases the mechanism of prediction instead of the content of prediction. This shift in focus brings into relief a space between model-based and model-free methods that contains exciting directions for reinforcement learning. The first half of this post describes some of the classic tools in this space, including
<a href="https://www.cs.swarthmore.edu/~meeden/DevelopmentalRobotics/horde1.pdf">generalized value functions</a> and the <a href="http://www.gatsby.ucl.ac.uk/~dayan/papers/d93b.pdf">successor representation</a>. The latter half is based on our recent paper about <a href="https://arxiv.org/abs/2010.14496">infinite-horizon predictive models</a>, for which code is available <a href="https://github.com/JannerM/gamma-models">here</a>.
</p>
<!--more-->
<!-- begin section II: what-versus-how -->
<h2 id="what-how">The <i>what</i> versus <i>how</i> of prediction</h2>
<p>
The dichotomy between model-based and model-free algorithms focuses on what is predicted directly: states or values. Instead, I want to focus on how these predictions are made, and specifically how these approaches deal with the complexities arising from long horizons.
</p>
<p>
Dynamics models, for instance, approximate a single-step transition distribution, meaning that they are trained on a prediction problem with a horizon of one. In order to make a short-horizon model useful for long-horizon queries, its single-step predictions are composed in the form of sequential model-based rollouts. We could say that the “testing” horizon of a model-based method is that of the rollout.
</p>
<p>
In contrast, value functions themselves are long-horizon predictors; they need not be used in the context of rollouts because they already contain information about the extended future. In order to amortize this long-horizon prediction, value functions are trained with either Monte Carlo estimates of expected cumulative reward or with dynamic programming. The important distinction is now that the long-horizon nature of the prediction task is dealt with during training instead of during testing.
</p>
<p><br /></p>
<center>
<div style="width: 90%;">
<div style="width: 45%; float: left;">
<img src="https://bair.berkeley.edu/static/blog/successor/mb-mf.png" width="100%" />
</div>
<div style="width: 47%; float: right;">
<p>
<br />
<i>We can organize reinforcement learning algorithms in terms of when they deal with long-horizon complexity. Dynamics models train for a short-horizon prediction task but are deployed using long-horizon rollouts. In contrast, value functions amortize the work of long-horizon prediction at training, so a single-step prediction (and informally, a shorter "horizon") is sufficient during testing.</i>
<br /><br />
</p>
</div>
<br />
</div>
</center>
<p><br clear="left" />
<br /></p>
<p>
Taking this view, the fact that models predict states and value functions predict cumulative rewards is almost a detail. What really matters is that models predict <i>immediate</i> next states and value functions predict <i>long-term sums</i> of rewards. This idea is nicely summarized in a line of work on <a href="https://www.cs.swarthmore.edu/~meeden/DevelopmentalRobotics/horde1.pdf">generalized</a> <a href="https://sites.ualberta.ca/~amw8/phd.pdf">value</a> <a href="http://incompleteideas.net/papers/maei-sutton-10.pdf">functions</a>, describing how temporal difference learning may be used to make long-horizon predictions about any kind of cumulant, of which a reward function is simply one example.
</p>
<p>
This framing also suggests that some phenomena we currently think of as distinct, like <a href="https://arxiv.org/abs/1906.08253">compounding model prediction errors</a> and <a href="https://arxiv.org/abs/1906.00949">bootstrap error accumulation</a>, might actually be different lenses on the same problem. The former describes the growth in error over the course of a model-based rollout, and the latter describes the propagation of error via the Bellman backup in model-free reinforcement learning. If models and value functions differ primarily in when they deal with horizon-based difficulties, then it should come as no surprise that the testing-time error compounding of models has a direct training-time analogue in value functions.
</p>
<p>
A final reason to be interested in this alternative categorization is that it allows us to think about hybrids that do not make sense under the standard dichotomy. For example, if a model were to make long-horizon state predictions by virtue of training-time amortization, it would avoid the need for sequential model-based rollouts and circumvent testing-time compounding errors. The remaining sections describe how we can build such a model, beginning with the foundation of the successor representation and then introducing new work for making this form of prediction compatible with continuous spaces and neural samplers.
</p>
<!-- begin section III: the successor representation -->
<h2 id="model-based-techniques">The successor representation</h2>
<p>
The <a href="http://www.gatsby.ucl.ac.uk/~dayan/papers/d93b.pdf">successor representation</a> (SR), an idea influential in both <a href="https://www.nature.com/articles/s41562-017-0180-8">cognitive</a> <a href="https://www.jneurosci.org/content/38/33/7193">science</a> and <a href="https://arxiv.org/abs/1606.02396">machine</a> <a href="https://arxiv.org/abs/1606.05312">learning</a>, is a long-horizon, policy-dependent dynamics model. It leverages the insight that the same type of recurrence relation used to train \(Q\)-functions:
\[
Q(\mathbf{s}_t, \mathbf{a}_t) \leftarrow
\mathbb{E}_{\mathbf{s}_{t+1}}
[r(\mathbf{s}_{t}, \mathbf{a}_t, \mathbf{s}_{t+1}) + \gamma V(\mathbf{s}_{t+1})]
\]
may also be used to train a model that predicts states instead of values:
\[
M(\mathbf{s}_{t}, \mathbf{a}_t) \leftarrow
\mathbb{E}_{\mathbf{s}_{t+1}}
[\mathbf{1}(\mathbf{s}_{t+1}) + \gamma M(\mathbf{s}_{t+1})] \tag*{(1)}
\]
</p>
<p>
The key difference between the two is that the scalar rewards \(r(\mathbf{s}_t, \mathbf{a}_t, \mathbf{s}_{t+1})\) from the \(Q\)-function recurrence are now replaced with one-hot indicator vectors \(\mathbf{1}(\mathbf{s}_{t+1})\) denoting states. As such, SR training may be thought of as vector-valued \(Q\)-learning. The size of the “reward” vector, as well as the successor predictions \(M(\mathbf{s}_t, \mathbf{a}_t)\) and \(M(\mathbf{s}_t)\), is equal to the number of states in the MDP.
</p>
<p>
In contrast to standard dynamics models, which approximate a single-step transition distribution, SR approximates what is known as the discounted occupancy:
\[
\mu(\mathbf{s}_e \mid \mathbf{s}_t, \mathbf{s}_t) = (1 - \gamma)
\sum_{\Delta t=1}^{\infty} \gamma^{\Delta t - 1}
p(
\mathbf{s}_{t+\Delta t} = \mathbf{s}_e \mid
\mathbf{s}_t, \mathbf{a}_t, \pi
)
\]
</p>
<p>
This occupancy is a weighted mixture over an infinite series of multi-step models, with the mixture weights being controlled by a discount factor \(\gamma\).<sup id="fnref:exit-state"><a href="#fn:exit-state" class="footnote"><font size="-2">1</font></a></sup> <sup id="fnref:options"><a href="#fn:options" class="footnote"><font size="-2">2</font></a></sup> Setting \(\gamma=0\) recovers a standard single-step model, and any \(\gamma \in (0,1)\) induces a model with an infinite, probabilistic horizon. The predictive lookahead of the model qualitatively increases with larger \(\gamma\).
</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/successor/tolman.gif" width="60%" />
<br />
<i>The successor representation of a(n optimal) rat in a maze<sup id="fnref:maze"><a href="#fn:maze" class="footnote"><font size="-2">3</font></a></sup>, showing the rat’s path with a probabilistic horizon determined by discount factor \(\gamma\).
</i>
</p>
<!-- begin section III: gamma-models -->
<h2 id="gamma-models">Generative models in continuous spaces: from SR to \(\boldsymbol{\gamma}\)-models</h2>
<p>
<a href="https://arxiv.org/abs/1606.02396">Continuous</a> <a href="https://arxiv.org/abs/1606.05312">adaptations</a> of SR replace the one-hot state indicator \(\mathbf{1}(\mathbf{s}_t)\) in Equation 1 with a learned state featurization \(\phi(\mathbf{s}_t, \mathbf{a}_t)\), giving a recurrence of the form:
\[
\psi(\mathbf{s}_t, \mathbf{a}_t) \leftarrow \phi(\mathbf{s}_t, \mathbf{a}_t) + \gamma
\mathbb{E}_{\mathbf{s}_{t+1}} [\psi(\mathbf{s}_{t+1})]
\]
</p>
<p>
This is not a generative model in the usual sense, but is instead known as an expectation model: \(\psi\) denotes the expected feature vector \(\phi\). The advantage to this approach is that an expectation model is easier to train than a generative model. Moreover, if rewards are linear in the features, an expectation model is sufficient for value estimation.
</p>
<p>
However, the limitation of an expectation model is that it cannot be employed in some of the most common use-cases of predictive dynamics models. Because \(\psi(\mathbf{s}_t, \mathbf{a}_t)\) only predicts a first moment, we cannot use it to sample future states or perform model-based rollouts.
</p>
<p>
To overcome this limitation, we can turn the discriminative update used in SR and its continuous variants into one suitable for training a generative model \({\color{#D62728}\mu}\):
\[
\max_{\color{#D62728}\mu} \mathbb{E}_{\mathbf{s}_t, \mathbf{a}_t, \mathbf{s}_{t+1} \sim \mathcal{D}} [ \mathbb{E}_{
\mathbf{s}_e \sim (1-\gamma) p(\cdot \mid \mathbf{s}_t, \mathbf{a}_t) + \gamma
{\color{#D62728}\mu}(\cdot \mid \mathbf{s}_{t+1})
}
[\log {\color{#D62728}\mu}(\mathbf{s}_e \mid \mathbf{s}_t, \mathbf{a}_t)] ]
\]
</p>
<p>
At first glance, this looks like a standard maximum likelihood objective. The important difference is that the distribution over which the inner expectation is evaluated depends on the model \({\color{#D62728}\mu}\) itself. Instead of a bootstrapped target value like those commonly used in model-free algorithms, we now have a bootstrapped target distribution.
\[
\underset{
\vphantom{\Huge\Sigma}
\Large \text{target value}
}{
r + \gamma V
}
~~~~~~~~ \Longleftrightarrow ~~~~~~~~
\underset{
\vphantom{\Huge\Sigma}
\Large \text{target }\color{#D62728}{\text{distribution}}
}{
(1-\gamma) p + \gamma {\color{#D62728}\mu}
}
\]
Varying the discount factor \(\gamma\) in the target distribution yields models that predict increasingly far into the future.
</p>
<p><br /></p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/successor/gamma-flow-acrobot.png" width="100%" />
<br />
<br />
<i>Predictions of a \(\gamma\)-model for varying discounts \(\gamma\). The rightmost column shows Monte Carlo estimates of the discounted occupancy corresponding to \(\gamma=0.95\) for reference. The conditioning state is denoted by \(\circ\).
</i>
</p>
<p><br /></p>
<p>
In the spirit of infinite-horizon model-free control, we refer to this formulation as infinite-horizon prediction and the corresponding model as a \(\gamma\)-model. Because the bootstrapped maximum likelihood objective circumvents the need for reward vectors the size of the state space, \(\gamma\)-model training is suitable for continuous spaces while retaining an interpretation as a generative model. In our paper we show how to instantiate \(\gamma\)-models as both <a href="https://arxiv.org/abs/1505.05770">normalizing flows</a> and <a href="https://arxiv.org/abs/1406.2661">generative adversarial networks</a>.
</p>
<!-- begin section IV: model-based control -->
<h2 id="model-based-control">Generalizing model-based control with \(\boldsymbol{\gamma}\)-models</h2>
<p>
Replacing single-step dynamics models with \(\gamma\)-models leads to generalizations of some of the staples of model-based control:
</p>
<p>
<strong>Rollouts:</strong> \(\gamma\)-models divorce timestep from model step. As opposed to incrementing one timestep into the future with every prediction, \(\gamma\)-model rollout steps have a negative binomial distribution over time. It is possible to reweight these \(\gamma\)-model steps to simulate the predictions of a model trained with higher discount.
</p>
<p style="text-align:center;">
<img id="rollout" src="https://bair.berkeley.edu/static/blog/successor/rollout_00.png" width="100%" />
<br />
<i>Whereas conventional dynamics models predict a single step into the future, \(\gamma\)-model rollout steps have a negative binomial distribution over time. The first step of a \(\gamma\)-model has a geometric distribution from the special case of \(~\text{NegBinom}(1, p) = \text{Geom}(1-p)\).</i>
</p>
<p><br /></p>
<p>
<strong>Value estimation:</strong> Single-step models estimate values using long model-based rollouts, often between tens and hundreds of steps long. In contrast, values are expectations over a single feedforward pass of a \(\gamma\)-model. This is similar to a decomposition of value as an inner product, as seen in <a href="https://arxiv.org/abs/1606.05312">successor features</a> and <a href="https://arxiv.org/abs/1606.02396">deep SR</a>. In tabular spaces with indicator rewards, the inner product and expectation are the same!
</p>
<div>
<center>
<figure class="video_container">
<video width="80%" height="auto" autoplay="" loop="" playsinline="" muted="">
<source src="https://bair.berkeley.edu/static/blog/successor/value_estimation.mp4" type="video/mp4" />
</video>
</figure>
</center>
<p style="text-align:center;">
<i>Because values are expectations of reward over a single step of a \(\gamma\)-model, we can perform value estimation without sequential model-based rollouts.</i>
</p>
</div>
<p><br /></p>
<p>
<strong>Terminal value functions:</strong> To account for truncation error in single-step model-based rollouts, it is common to augment the rollout with a terminal value function. This strategy, sometimes referred to as <a href="https://arxiv.org/abs/1803.00101">model-based value expansion</a> (MVE), has an abrupt handoff between the model-based rollout and the model-free value function. We can derive an analogous strategy with a \(\gamma\)-model, called \(\gamma\)-MVE, that features a gradual transition between model-based and model-free value estimation. This value estimation strategy can be incorporated into a model-based reinforcement learning algorithm for improved sample-efficiency.
</p>
<p><br /></p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/successor/consolidated.png" width="100%" />
<br />
<i>
\(\gamma\)-MVE features a gradual transition between model-based and model-free value estimation.
</i>
</p>
<hr />
<p>
This post is based on the following paper:
</p>
<ul>
<li>
<a href="https://arxiv.org/abs/2010.14496"><strong>\(\gamma\)-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction</strong></a>
<br />
<a href="https://people.eecs.berkeley.edu/~janner/">Michael Janner</a>, <a href="https://scholar.google.com/citations?user=Vzr1RukAAAAJ&hl=en">Igor Mordatch</a>, and <a href="https://people.eecs.berkeley.edu/~svlevine/">Sergey Levine</a>
<br />
<em>Neural Information Processing Systems (NeurIPS), 2020.</em>
<br />
<a href="https://github.com/JannerM/gamma-models">Open-source code</a> (runs in your browser!)
</li>
</ul>
<hr />
<div class="footnotes">
<ol>
<li id="fn:exit-state">
<p>
The \(e\) subscript in \(\mathbf{s}_e\) is short for "exit", which comes from an interpretation of the discounted occupancy as the exit state in a modified MDP in which there is a constant \(1-\gamma\) probability of termination at each timestep.<a href="#fnref:exit-state" class="reversefootnote">↩</a>
</p>
</li>
<li id="fn:options">
<p>
Because the discounted occupancy plays such a central role in reinforcement learning, its approximation by Bellman equations has been a focus in multiple lines of research. For example, <a href="https://people.cs.umass.edu/~barto/courses/cs687/Sutton-Precup-Singh-AIJ99.pdf">option models</a> and <a href="http://www.incompleteideas.net/papers/sutton-95.pdf">\(\beta\)-models</a> describe generalizations of this idea that allow for state-dependent termination conditions and arbitrary timestep mixtures.<a href="#fnref:options" class="reversefootnote">↩</a>
</p>
</li>
<li id="fn:maze">
<p>
If this particular maze looks familiar, you might have seen it in Tolman’s <a href="https://personal.utdallas.edu/~tres/spatial/tolman.pdf">Cognitive Maps in Rats and Men</a>. (Our web version has been stretched slightly horizontally.)<a href="#fnref:maze" class="reversefootnote">↩</a>
</p>
</li>
</ol>
</div>
<hr />
<p>
<font size="-1">
<strong>References</strong>
<ol style="margin-top:-15px">
<li>A Barreto, W Dabney, R Munos, JJ Hunt, T Schaul, HP van Hasselt, and D Silver. <a href="https://arxiv.org/abs/1606.05312">Successor features for transfer in reinforcement learning.</a> <i>NeurIPS</i> 2017.</li>
<li>P Dayan. <a href="http://www.gatsby.ucl.ac.uk/~dayan/papers/d93b.pdf">Improving generalization for temporal difference learning: The successor representation.</a> <i>Neural Computation</i> 1993.</li>
<li>Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph E. Gonzalez, Sergey Levine. <a href="https://arxiv.org/abs/1803.00101">Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning.</a> <i>ICML</i> 2018.</li>
<li>SJ Gershman. <a href="https://www.jneurosci.org/content/38/33/7193">The successor representation: Its computational logic and neural substrates.</a> <i>Journal of Neuroscience</i> 2018.</li>
<li>IJ Goodfellow, J Pouget-Abadie, M Mirza, B Xu, D Warde-Farley, S Ozair, A Courville, Y Bengio. <a href="https://arxiv.org/abs/1406.2661">Generative Adversarial Networks.</a> <i>NeurIPS</i> 2014.</li>
<li>M Janner, J Fu, M Zhang, S Levine. <a href="https://arxiv.org/abs/1906.08253">When to Trust Your Model: Model-Based Policy Optimization.</a> <i>NeurIPS</i> 2019.</li>
<li>TD Kulkarni, A Saeedi, S Gautam, and SJ Gershman. <a href="https://arxiv.org/abs/1606.02396">Deep successor reinforcement learning.</a> 2016.</li>
<li>A Kumar, J Fu, G Tucker, S Levine. <a href="https://arxiv.org/abs/1906.00949">Stabilizing Off-Policy \(Q\)-Learning via Bootstrapping Error Reduction.</a> <i>NeurIPS</i> 2019.</li>
<li>HR Maei and RS Sutton. <a href="http://incompleteideas.net/papers/maei-sutton-10.pdf">GQ(\(\lambda\)): A general gradient algorithm for temporal-difference prediction learning with eligibility traces.</a> <i>AGI</i> 2010.</li>
<li>I Momennejad, EM Russek, JH Cheong, MM Botvinick, ND Daw, and SJ Gershman. <a href="https://www.nature.com/articles/s41562-017-0180-8">The successor representation in human reinforcement learning.</a> <i>Nature Human Behaviour</i> 2017.</li>
<li>DJ Rezende and S Mohamed. <a href="https://arxiv.org/abs/1505.05770">Variational Inference with Normalizing Flows.</a> <i>ICML</i> 2015.</li>
<li>RS Sutton. <a href="http://www.incompleteideas.net/papers/sutton-95.pdf">TD Models: Modeling the World at a Mixture of Time Scales.</a> <i>ICML</i> 1995.</li>
<li>RS Sutton, J Modayil, M Delp, T Degris, PM Pilarski, A White, and D Precup. <a href="https://www.cs.swarthmore.edu/~meeden/DevelopmentalRobotics/horde1.pdf">Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction.</a> <i>AAMAS</i> 2011.</li>
<li>RS Sutton, D Precup, and S Singh. <a href="https://people.cs.umass.edu/~barto/courses/cs687/Sutton-Precup-Singh-AIJ99.pdf">Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.</a> <i>Artificial Intelligence</i> 1999.</li>
<li>E Tolman. <a href="https://personal.utdallas.edu/~tres/spatial/tolman.pdf">Cognitive Maps in Rats and Men.</a> <i>Psychological Review</i> 1948.</li>
<li>A White. <a href="https://sites.ualberta.ca/~amw8/phd.pdf">Developing a predictive approach to knowledge.</a> PhD thesis, 2015.</li>
</ol>
</font>
</p>
Tue, 05 Jan 2021 01:00:00 -0800
https://bairblog.github.io/2021/01/05/successor/
https://bairblog.github.io/2021/01/05/successor/Does GPT-2 Know Your Phone Number?<meta name="twitter:title" content="Does GPT-2 Know Your Phone Number?" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/lmmem/fig1.png" />
<p>Most likely not.</p>
<p>Yet, OpenAI’s <a href="https://openai.com/blog/better-language-models/">GPT-2 language model</a> <em>does</em> know how to reach a certain Peter W<mark style="background-color: black; color: black">---</mark> (name redacted for privacy). When prompted with a short snippet of Internet text, the model accurately generates Peter’s contact information, including his work address, email, phone, and fax:</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/lmmem/fig1.png" width="40%" />
<br />
</p>
<p>In our <a href="https://arxiv.org/abs/2012.07805">recent paper</a>, we evaluate how large language models <em>memorize</em> and <em>regurgitate</em> such rare snippets of their training data. <strong>We focus on GPT-2 and find that at least 0.1% of its text generations (a very conservative estimate) contain long verbatim strings that are “copy-pasted” from a document in its training set.</strong></p>
<p>Such memorization would be an obvious issue for language models that are trained on private data, e.g., on users’ <a href="https://www.blog.google/products/gmail/subject-write-emails-faster-smart-compose-gmail/">emails</a>, as the model might inadvertently output a user’s sensitive conversations. Yet, even for models that are trained on <em>public</em> data from the Web (e.g., GPT-2, <a href="https://arxiv.org/abs/2005.14165">GPT-3</a>, <a href="https://arxiv.org/abs/1910.10683">T5</a>, <a href="https://arxiv.org/abs/1907.11692">RoBERTa</a>, <a href="https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/">TuringNLG</a>), memorization of training data raises multiple challenging regulatory questions, ranging from misuse of personally identifiable information to copyright infringement.</p>
<!--more-->
<h2 id="extracting-memorized-training-data">Extracting Memorized Training Data</h2>
<p>Regular readers of the BAIR blog may be familiar with the issue of data memorization in language models. <a href="https://bair.berkeley.edu/blog/2019/08/13/memorization/">Last year</a>, our co-author Nicholas Carlini described a paper that tackled a simpler problem: measuring memorization of a specific sentence (e.g., a credit card number) that was explicitly injected into the model’s training set.</p>
<p>In contrast, our aim is to extract <em>naturally occuring data</em> that a language model has memorized. This problem is more challenging, as we do not know a priori what kind of text to look for. Maybe the model memorized credit card numbers, or maybe it memorized entire book passages, or even code snippets.</p>
<p>Note that since large language models exhibit minimal overfitting (their train and test losses are nearly identical), we know that memorization, if it occurs, must be a rare phenomenon. <a href="https://arxiv.org/abs/2012.07805">Our paper</a> describes how to find such examples using the following two-step “extraction attack”:</p>
<ul>
<li>
<p>First, we generate a large number of samples by interacting with GPT-2 as a black-box (i.e., we feed it short prompts and collect generated samples).</p>
</li>
<li>
<p>Second, we keep generated samples that have an abnormally high likelihood. For example, we retain any sample on which GPT-2 assigns a much higher likelihood than a different language model (e.g., a smaller variant of GPT-2).</p>
</li>
</ul>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/lmmem/fig2.png" width="100%" />
<br />
</p>
<p>We generated a total of 600,000 samples by querying GPT-2 with three different sampling strategies. Each sample contains 256 tokens, or roughly 200 words on average. Among these samples, we selected 1,800 samples with abnormally high likelihood for manual inspection. <strong>Out of the 1,800 samples, we found 604 that contain text which is reproduced verbatim from the training set.</strong></p>
<p>Our paper shows that some instantiations of the above extraction attack can reach up to 70% precision in identifying rare memorized data. In the rest of this post, we focus on <strong>what</strong> we found lurking in the memorized outputs.</p>
<h2 id="problematic-data-memorization">Problematic Data Memorization</h2>
<p>We were surprised by the diversity of the memorized data. The model re-generated lists of news headlines, Donald Trump speeches, pieces of software logs, entire software licenses, snippets of source code, passages from the Bible and Quran, the first 800 digits of pi, and much more!</p>
<p>The figure below summarizes some of the most prominent categories of memorized data.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/lmmem/fig3.png" width="100%" />
<br />
</p>
<p>While some forms of memorization are fairly benign (e.g., memorizing the digits of pi), others are much more problematic. Below, we showcase the model’s ability to memorize personally identifiable data and copyrighted text, and discuss the yet-to-be-determined legal ramifications of such behavior in machine learning models.</p>
<h2 id="memorization-of-personally-identifiable-information">Memorization of Personally Identifiable Information</h2>
<p>Recall GPT-2’s intimate knowledge of Peter W. An Internet search shows that Peter’s information is available on the Web, but only on six professional pages.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/lmmem/fig4.png" width="100%" />
<br />
</p>
<p>Peter’s case is not unique: about 13% of the memorized examples contain names or contact information (emails, twitter handles, phone numbers, etc.) of both individuals and companies. And while none of this personal information is “secret” (anyone can find it online), its inclusion in a language model still poses numerous privacy concerns. In particular, it might violate user-privacy legislations such as the GDPR, as described below.</p>
<h4>Violations of Contextual Integrity and Data Security</h4>
<p>When Peter put his contact information online, it had an intended <em>context of use</em>. Unfortunately, applications built on top of GPT-2 are unaware of this context, and might thus unintentionally share Peter’s data in ways he did not intend. For example, Peter’s contact information might be inadvertently output by a customer service chatbot.</p>
<p>To make matters worse, we found numerous cases of GPT-2 generating memorized personal information in contexts that can be deemed offensive or otherwise inappropriate. In one instance, GPT-2 generates <em>fictitious</em> IRC conversations between two real users on the topic of transgender rights. A redacted snippet is shown below:</p>
<blockquote>
<p>[2015-03-11 14:04:11] <mark style="background-color: black; color: black">------</mark> or if you’re a trans woman <br />
[2015-03-11 14:04:13] <mark style="background-color: black; color: black">------</mark> you can still have that <br />
[2015-03-11 14:04:20] <mark style="background-color: black; color: black">------</mark> if you want your dick to be the same <br />
[2015-03-11 14:04:25] <mark style="background-color: black; color: black">------</mark> as a trans person <br /></p>
</blockquote>
<p>The specific usernames in this conversation only appear <em>twice</em> on the entire Web, both times in private IRC logs that were leaked online as part of the <a href="https://en.wikipedia.org/wiki/Gamergate_controversy">GamerGate harassment campaign.</a></p>
<p>In another case, the model generates a news story about the murder of M. R. (a real event). However, GPT-2 incorrectly attributes the murder to A. D., who was in fact a murder <em>victim</em> in an unrelated crime.</p>
<blockquote>
<p>A<mark style="background-color: black; color: black">---</mark> D<mark style="background-color: black; color: black">---</mark>, 35, was indicted by a grand jury in April, and was arrested after a police officer found the bodies of his wife, M<mark style="background-color: black; color: black">---</mark> R<mark style="background-color: black; color: black">---</mark>, 36, and daughter</p>
</blockquote>
<p>These examples illustrate how personal information being present in a language model can be much more problematic than it being present in systems with more limited scopes. For example, search engines also scrape personal data from the Web but only output it in a well-defined context (the search results). Misuse of personal data can present serious legal issues. For example, the <a href="https://gdpr-info.eu/art-5-gdpr/">GDPR</a> in the European Union states:</p>
<blockquote>
<p><em>“personal data shall be […] collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes […] [and] processed in a manner that ensures appropriate security of the personal data”</em></p>
</blockquote>
<p>Memorizing personal data likely does not constitute “appropriate security”, and there is an argument that the data’s implicit inclusion in the outputs of downstream systems is not compatible with the original purpose of data collection, i.e., generic language modeling.</p>
<p>Aside from data misuse violations, misrepresenting individuals’ personal information in inappropriate contexts also touches on existing privacy regulations guarding against defamation or <a href="https://en.wikipedia.org/wiki/False_light">false light</a> torts. Similarly, misrepresenting companies or product names could violate trademark laws.</p>
<h4>Invoking the “Right To Be Forgotten” </h4>
<p>The above data misuses could compel individuals to request to have their data removed from the model. They might do so by invoking emerging “<em>right to be forgotten</em>” laws, e.g., the <a href="https://gdpr-info.eu/art-17-gdpr/">GDPR</a> in the EU or the <a href="https://ccpa-info.com/home/1798-105-consumers-right-to-deletion/">CCPA</a> in California. These laws enable individuals to request to have their personal data be deleted from online services such as Google search.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/lmmem/fig5.png" width="100%" />
<br />
</p>
<p>There is a <a href="https://arxiv.org/abs/2002.10635">legal grey area</a> as to how these regulations should apply to machine learning models. For example, <em>can users ask to have their data removed from a model’s training data?</em> Moreover, if such a request were granted, <em>must the model be retrained from scratch</em>? The fact that models can memorize and misuse an individual’s personal information certainly makes the case for data deletion and retraining more compelling.</p>
<h2 id="memorization-of-copyright-data">Memorization of Copyrighted Data</h2>
<p>Another type of content that the model memorizes is copyrighted text.</p>
<h4>Memorization of Books</h4>
<p>Our first example will actually come from GPT-3, a model 100 times larger than GPT-2. Our paper shows that larger language models memorize more, so we expect GPT-3 to memorize an even larger amount of data.</p>
<p>Below, we prompt GPT-3 with the beginning of chapter 3 of <em>Harry Potter and the Philosopher’s Stone</em>. <strong>The model correctly reproduces about one full page of the book</strong> (about 240 words) before making its first mistake.</p>
<blockquote>
<p><strong>The escape of the Brazilian boa constrictor earned Harry his longest-ever punishment. By the time he was allowed out of his cupboard again, the</strong>
summer holidays had started and Dudley had already broken his new video camera, crashed his remote-control aeroplane, and, first time out on his racing bike, knocked down old Mrs Figg as she crossed Privet Drive on her crutches.<br />
Harry was glad school was over, but there was no escaping Dudley’s gang, who visited the house every single day. Piers, Dennis, Malcolm, and Gordon were all big and stupid, but as Dudley was the biggest and stupidest of the lot, he was the leader. The rest of them were all quite happy to join in Dudley’s favourite sport: Harry Hunting.<br /><br />
This was why Harry spent as much time as possible out of the house, wandering around and thinking about the end of the holidays, where he could see a tiny ray of hope. When September came he would be going off to secondary school and, for the first time in his life, he wouldn’t be with Dudley. Dudley had been accepted at Uncle Vernon’s old private school, Smeltings. Piers Polkiss was going there too. Harry, on the other hand, was going to Stonewall High, the local public school. Dudley thought this was very funny.<br /><br />
‘They stuff people’s heads down the toilet the first day at Stonewall,’ he told Harry. ‘Want to come upstairs and practise?’<br /><br />
‘No, thanks,’ said Harry. ‘The poor toilet’s never had anything as horrible as your head down it — it might be sick.’<br /><br /></p>
</blockquote>
<h4>Memorization of Code</h4>
<p>Language models also memorize other types of copyrighted data such as source code. For example, GPT-2 can output 264 lines of code from the <a href="https://github.com/bitcoin/bitcoin/blob/d0a6353dec48f365c38de3c76b42f67eda737ed5/src/main.cpp#L3638">Bitcoin client</a> (with 6 minor mistakes). Below, we show one function that GPT-2 reproduces perfectly:</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/lmmem/fig7.png" width="100%" />
<br />
</p>
<p><strong>We also found at least one example where GPT-2 can reliably output an entire file.</strong> The document in question is a configuration file for the game <a href="https://en.wikipedia.org/wiki/Dirty_Bomb_(video_game)">Dirty Bomb</a>. The file contents produced by GPT-2 seem to be memorized from an <a href="https://www.diffchecker.com/unplpvqu">online diff checker</a>. When prompted with the first two lines of the file, GPT-2 outputs the remaining 1446 lines verbatim (with a >99% character-level match).</p>
<p>These are just a few of the many instances of copyrighted content that the model memorized from its training set. Furthermore, note that while books and source code typically have an explicit copyright license, the <em>vast majority</em> of Internet content is also automatically copyrighted under <a href="https://www.law.cornell.edu/uscode/text/17/102">US law</a>.</p>
<h4>Does Training Language Models Infringe on Copyright?</h4>
<p>Given that language models memorize and regurgitate copyrighted content, does that mean they constitute copyright infringement? The legality of training models on copyrighted data has been a subject of debate among legal scholars (see e.g., <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3528447">Fair Learning</a>, <a href="https://ilr.law.uiowa.edu/print/volume-101-issue-2/copyright-for-literate-robots/">Copyright for Literate Robots</a>, <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3032076">Artificial Intelligence’s Fair Use Crisis</a>), with arguments both in favor and against the characterization of machine learning as “fair use”.</p>
<p>The issue of data memorization certainly has a role to play in this debate. Indeed, in response to a <a href="https://www.uspto.gov/sites/default/files/documents/USPTO_AI-Report_2020-10-07.pdf">request-for-comments</a> from the US Patent Office, multiple parties argue in favor of characterizing machine learning as fair use, in part because machine learning models are assumed to <strong>not</strong> emit memorized data.</p>
<p>For example, the <a href="https://www.uspto.gov/sites/default/files/documents/Electronic%20Frontier%20Foundation_RFC-84-FR-58141.PDF">Electronic Frontier Foundation</a> writes:</p>
<blockquote>
<p><em>“the extent that a work is produced with a machine learning tool that was trained on a large number of copyrighted works, the degree of copying with respect to any given work is likely to be, at most, de minimis.”</em></p>
</blockquote>
<p>A similar argument is put forward by <a href="https://www.uspto.gov/sites/default/files/documents/OpenAI_RFC-84-FR-58141.pdf">OpenAI</a>:</p>
<blockquote>
<p><em>“Well-constructed AI systems generally do not regenerate, in any nontrivial portion, unaltered data from any particular work in their training corpus”</em></p>
</blockquote>
<p>Yet, as our work demonstrates, large language models certainly are able to produce large portions of memorized copyrighted data, including certain documents in their entirety.</p>
<p>Of course, the above parties’ defense of fair use does not hinge solely on the assumption that models do not memorize their training data, but our findings certainly seem to weaken this line of argument. Ultimately, the answer to this question might depend on the manner in which a language model’s outputs are used. For example, outputting a page from Harry Potter in a downstream creative-writing application points to a much clearer case of copyright infringement than the same content being spuriously output by a translation system.</p>
<h2 id="mitigations">Mitigations</h2>
<p>We’ve seen that large language models have a remarkable ability to memorize rare snippets of their training data, with a number of problematic consequences. So, how could we go about preventing such memorization from happening?</p>
<h4>Differential Privacy Probably Won’t Save the Day</h4>
<p>Differential privacy is a well-established formal notion of privacy that appears to be a natural solution to data memorization. In essence, training with differential privacy provides guarantees that a model will not leak any individual record from its training set.</p>
<p>Yet, it appears challenging to apply differential privacy in a principled and effective manner to prevent memorization of Web-scraped data. First, differential privacy does not prevent memorization of information that occurs across a large number of records. This is particularly problematic for copyrighted works, which might appear thousands of times across the Web.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/lmmem/fig6.png" width="100%" />
<br />
</p>
<p>Second, even if certain records only appear a few times in the training data (e.g., Peter’s personal data appears on a few pages), applying differential privacy in the most effective manner would require <em>aggregating</em> all these pages into a single record and providing per-user privacy guarantees for the aggregated records. It is unclear how to do this aggregation effectively at scale, especially since some webpages might contain personal information from many different individuals.</p>
<h4>Sanitizing the Web Is Hard Too</h4>
<p>An alternative mitigation strategy is to simply remove personal information, copyrighted data, and other problematic training data. This too is difficult to apply effectively at scale. For example, we might want to automatically remove mentions of Peter W.’s personal data, but keep mentions of personal information that is considered “general knowledge”, e.g., the biography of a US president.</p>
<h4>Curated Datasets as a Path Forward</h4>
<p>If neither differential privacy or automated data sanitization are going to solve our problems, what are we left with?</p>
<p>Perhaps training language models on data from the open Web might be a fundamentally flawed approach. Given the numerous privacy and legal concerns that may arise from memorizing Internet text, in addition to the many <a href="https://science.sciencemag.org/content/356/6334/183">undesirable</a> <a href="https://arxiv.org/abs/1607.06520">biases</a> that Web-trained models perpetrate, the way forward might be better curation of datasets for training language models. We posit that if even a small fraction of the millions of dollars that are invested into training language models were instead put into collecting better training data, significant progress could be made to mitigate language models’ harmful side effects.</p>
<p>Check out the paper <a href="https://arxiv.org/abs/2012.07805">Extracting Training Data from Large Language Models</a> by Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel.</p>
Sun, 20 Dec 2020 01:00:00 -0800
https://bairblog.github.io/2020/12/20/lmmem/
https://bairblog.github.io/2020/12/20/lmmem/Offline Reinforcement Learning: How Conservative Algorithms Can Enable New Applications<meta name="twitter:title" content="Offline Reinforcement Learning: How Conservative Algorithms Can Enable New Applications" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://paper-attachments.dropbox.com/s_1A8799ED5CDFE6A88275D1C17266132671FD624DC99A781FDC45B88F2D8141F8_1607330228005_tease.png" />
<p>Deep reinforcement learning has made significant progress in the last few years, with success stories in <a href="https://arxiv.org/abs/1603.02199">robotic control</a>, <a href="https://arxiv.org/abs/1710.02298">game playing</a> and <a href="https://advances.sciencemag.org/content/4/7/eaap7885">science problems</a>. While RL methods present a general paradigm where an agent learns from its own interaction with an environment, this requirement for “active” data collection is also a major hindrance in the application of RL methods to real-world problems, since active data collection is often expensive and potentially unsafe. An alternative <strong>“data-driven”</strong> paradigm of RL, referred to as <strong>offline RL</strong> <strong>(or</strong> <strong>batch RL</strong><strong>)</strong> has recently regained popularity as a viable path towards effective real-world RL. As shown in the figure below, offline RL requires learning skills solely from previously collected datasets, without any active environment interaction. It provides a way to utilize previously collected datasets from a variety of sources, including human demonstrations, prior experiments, domain-specific solutions and even data from different but related problems, to build complex decision-making engines.</p>
<p><img src="https://paper-attachments.dropbox.com/s_1A8799ED5CDFE6A88275D1C17266132671FD624DC99A781FDC45B88F2D8141F8_1607330228005_tease.png" alt="" /></p>
<!--more-->
<p>Several recent papers <a href="https://arxiv.org/abs/1812.02900">[1]</a> <a href="https://arxiv.org/abs/1712.06924">[2]</a> <a href="https://arxiv.org/abs/1607.03842">[3]</a> <a href="https://arxiv.org/abs/2005.13239">[4]</a> <a href="https://arxiv.org/abs/2005.05951">[5]</a> <a href="https://arxiv.org/abs/2007.08202">[6]</a>, including our prior work <a href="https://arxiv.org/abs/1906.00949">[7]</a> <a href="https://arxiv.org/abs/1910.00177">[8]</a>, have discussed that offline RL is a challenging problem — it requires handling distributional shifts, which in conjunction with function approximation and sampling error may make it impossible for standard RL methods <a href="https://arxiv.org/abs/2005.01643">[9]</a> <a href="https://arxiv.org/abs/2010.11895">[10]</a> to learn effectively from just a static dataset. However, over the past year, a number of methods have been proposed to tackle this problem, and substantial progress has been made in the area, both in development of new algorithms and applications to real-world problems. In this blog post, we will discuss two of our works that advance the frontiers of offline RL — conservative Q-learning (<a href="https://arxiv.org/abs/2006.04779">CQL</a>), a simple and effective algorithm for offline RL and <a href="https://arxiv.org/abs/2010.14500">COG</a>, a framework for robotic learning that leverages effective offline RL methods such as CQL, to allow agents to connect past data with recent experience, enabling a kind of “common sense” generalization when the robot is tasked with performing a task under a variety of new scenarios or initial conditions. The principle in the COG framework can also applied to other domains and is not specific to robotics.</p>
<h1 id="cql-a-simple-and-effective-method-for-offline-rl">CQL: A Simple And Effective Method for Offline RL</h1>
<p>The primary challenge in offline RL is successfully handling <em>distributional shift</em>: learning effective skills requires deviating from the behavior in the dataset and making counterfactual predictions (i.e., answering “what-if” queries) about unseen outcomes. However, counterfactual predictions for decisions that deviate too much from the behavior in the dataset cannot be made reliably. By virtue of the standard update procedure in RL algorithms (for example, Q-learning queries the Q-function at out-of-distribution inputs for computing the bootstrapping target during training), standard off-policy deep RL algorithms tend to overestimate the values of such unseen outcomes (as shown in the figure below), thereby deviating away from the dataset for an apparently promising outcome, but actually end up failing as a result.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_4404BB4C16A00FE79B3562C0E462BFD3BB7C036DBA26A6D769F81657A2FC37D8_1607309600445_Screenshot+2020-12-06+at+6.53.09+PM.png" width="600" style="margin: 2px;" />
<br />
<i>
Figure 1: Overestimation of unseen, out-of-distribution outcomes when standard off-policy deep RL algorithms (e.g., SAC) are trained on offline datasets. Note that while the return of the policy is negative in all cases, the Q-function estimate, which is the algorithm’s belief of its performance is extremely high ($\sim 10^{10}$ in some cases).
</i>
</p>
<h2 id="learning-conservative-q-functions">Learning Conservative Q-Functions</h2>
<p>A “safe” strategy when faced with such distributional shift is to be <em>conservative</em>: if we explicitly estimate the value of unseen outcomes conservatively (i.e. assign them a low value), then the estimated value or performance of the policy that executes unseen behaviors is guaranteed to be small. Using such conservative estimates for policy optimization will prevent the policy from executing unseen actions and it will perform reliably. Conservative Q-learning (CQL) does exactly this — it learns a value function such that the estimated performance of the policy under this learned value function lower-bounds its true value. As shown in the figure below, this lower-bound property ensures that no unseen outcome is overestimated, preventing the primary issue with offline RL.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_4404BB4C16A00FE79B3562C0E462BFD3BB7C036DBA26A6D769F81657A2FC37D8_1607312651028_Screenshot+2020-12-06+at+7.44.04+PM.png" width="800" style="margin: 2px;" />
<br />
<i>
Figure 2: Naïve Q-function training can lead to overestimation of unseen actions (i.e., actions not in support) which can make low-return behavior falsely appear promising. By underestimating the Q-value function for unseen actions at a state, CQL ensures that values of unseen behaviors are not overestimated, giving rise to the lower-bound property.
</i>
</p>
<p>To obtain this lower-bound on the actual Q-value function of the policy, CQL trains the Q-function using a sum of two objectives — standard TD error and a regularizer that minimizes Q-values on unseen actions with overestimated values while simultaneously maximizing the expected Q-value on the dataset:</p>
<p><img src="https://paper-attachments.dropbox.com/s_4404BB4C16A00FE79B3562C0E462BFD3BB7C036DBA26A6D769F81657A2FC37D8_1607326189705_Screenshot+2020-12-06+at+11.29.44+PM.png" alt="" /></p>
<p>We can then guarantee that the return-estimate of the learned policy $\pi$ under $Q^\pi_{\text{CQL}}$ is a lower-bound on the actual policy performance:</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_4404BB4C16A00FE79B3562C0E462BFD3BB7C036DBA26A6D769F81657A2FC37D8_1607319374697_image.png" width="725" style="margin: 2px;" />
<br />
</p>
<p>This means that, by addition of a simple regularizer during training, we can obtain non-overestimating Q-functions, and use them for policy optimization. The regularizer can be estimated using samples in the dataset, and so there is no need for explicit behavior policy estimation which is required by previous works <a href="https://arxiv.org/abs/1906.00949">[11]</a> <a href="https://arxiv.org/abs/1911.11361">[12]</a> <a href="https://arxiv.org/abs/1812.02900">[13]</a>. Behavior policy estimation doesn’t just need more machinery but estimation errors induced (for example, when the data-distribution is hard to model) can hurt downstream offline RL that uses this estimate [<a href="https://arxiv.org/abs/2006.09359">Nair et al. 2020</a>, <a href="https://arxiv.org/abs/2007.11091">Ghasemipour et al. 2020</a>]. In additions, a broad family of algorithmic instantiations of CQL can be derived by tweaking the form of the regularizer, provided that it still prevents overestimation on unseen actions.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_4404BB4C16A00FE79B3562C0E462BFD3BB7C036DBA26A6D769F81657A2FC37D8_1607314350204_Screenshot+2020-12-06+at+8.12.25+PM.png" width="800" style="margin: 2px;" />
<br />
<i>
Figure 3: The only change introduced in CQL is a modified training objective for the Q-function as highlighted above. This makes it simple to use CQL directly on top of any standard deep Q-learning or actor-critic implementations.
</i>
</p>
<p>Once a conservative estimate of the policy value $Q^\pi_{\text{CQL}}$ is obtained, CQL simply plugs this estimate into an actor-critic or Q-learning method, as shown above, and updates $\pi$ towards maximizing the conservative Q-function.</p>
<h2 id="so-how-well-does-cql-perform">So, how well does CQL perform?</h2>
<p>We evaluate CQL on a number of domains including image-based <a href="https://arxiv.org/abs/1207.4708">Atari games</a> and also several tasks from the <a href="https://github.com/rail-berkeley/d4rl">D4RL benchmark</a>. Here we present results on the Ant Maze domain from the D4RL benchmark. The goal in these tasks is to navigate the ant from a start state to a goal state. The offline dataset consists of random motions of the ant, but no single trajectory that solves the task. Any successful algorithm needs to “stitch” together different sub-trajectories to achieve success. While prior methods (BC, <a href="https://arxiv.org/abs/1801.01290">SAC</a>, <a href="https://arxiv.org/abs/1812.02900">BCQ</a>, <a href="https://arxiv.org/abs/1906.00949">BEAR</a>, <a href="https://arxiv.org/abs/1911.11361">BRAC</a>, <a href="https://arxiv.org/abs/1910.00177">AWR</a>, <a href="https://arxiv.org/abs/1912.02074">AlgaeDICE</a>) perform reasonably in the easy U-maze, they are unable to stitch trajectories in the harder mazes. In fact, CQL is the only algorithm to make non-trivial progress and obtains <strong>>50%</strong> and <strong>>14%</strong> success rates on medium and large mazes. This is because constraining the learned policy to the dataset explicitly as done in prior methods tends to be overly conservative: we need not constrain actions to the data if unseen actions have low learned Q-values. Since CQL imposes a “value-aware” regularizer, it avoids this over-conservatism.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_4404BB4C16A00FE79B3562C0E462BFD3BB7C036DBA26A6D769F81657A2FC37D8_1607319277151_Screenshot+2020-12-06+at+9.34.27+PM.png" width="800" style="margin: 2px;" />
<br />
<i>
Figure 4: Performance of CQL and other offline RL algorithms measured in terms of success rate (range [0, 100]) on the ant-maze navigation task from D4RL. Observe that CQL outperforms prior methods on the harder maze domains by non-trivial margins.
</i>
</p>
<p>On image-based Atari games, we observe that CQL outperforms prior methods (<a href="https://arxiv.org/abs/1710.10044">QR-DQN</a>, <a href="https://offline-rl.github.io/">REM</a>) in some cases by huge margins, for instance by a factor of <strong>5x</strong> and <strong>36x</strong> on Breakout and Q$^*$bert respectively, indicating that CQL is a promising algorithm for both continuous control and discrete action tasks, and it works not just from low-dimensional state, but also from as raw image observations.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_4404BB4C16A00FE79B3562C0E462BFD3BB7C036DBA26A6D769F81657A2FC37D8_1607319801879_Screenshot+2020-12-06+at+9.43.17+PM.png" width="480" style="margin: 2px;" />
<br />
<i>
Figure 5: Performance of CQL on five Atari games. Note that CQL outperforms prior methods: QR-DQN and REM that have been applied in this setting by 36x on Q*bert and 5x on Breakout.
</i>
</p>
<h1 id="what-new-capabilities-can-effective-offline-rl-methods-enable">What new capabilities can effective offline RL methods enable?</h1>
<p>Most advances in offline RL have been evaluated on standard RL benchmarks (including CQL, as discussed above), but are these algorithms ready to tackle the kind of real-world problems that motivate research in offline RL in the first place? One important ability that offline RL promises over other approaches for decision-making is the ability to ingest large, diverse datasets and produce solutions that generalize broadly to new scenarios. For example, policies that are effective at recommending videos to a <em>new</em> user or policies that can execute robotic tasks in <em>new</em> scenarios. The ability to generalize is essential in almost any machine learning system that we might build, but typical RL benchmark tasks do not test this property. We take a step towards addressing this issue and show that simple, domain-agnostic principles applied on top of effective data-driven offline RL methods can be highly effective in enabling <em>“<strong>common-sense</strong>”</em> generalization in AI systems.</p>
<h1 id="cog-learning-skills-that-generalize-via-offline-rl">COG: Learning Skills That Generalize via Offline RL</h1>
<p>COG is an algorithmic framework for utilizing large, unlabeled datasets of diverse behavior to learn generalizable policies via offline RL. As a motivating example, consider a robot that has been trained to take an object out of an open drawer (shown below). This robot is likely to fail when placed in a scene where the drawer is instead closed, since it has not seen this scenario (or initial condition) before.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606775242428_grasp_open_drawer.gif" width="330" style="margin: 2px;" />
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1607289273270_fail_2.gif" width="330" style="margin: 2px;" />
<br />
<i>
Figure 6: Left: We see a robot that has learned how to take an object out of an open drawer. Right: However, the same robot fails to perform the task if the drawer is closed at the beginning of the episode.
</i>
</p>
<!-- ![We see a robot that has learned how to take an object out of an open drawer.](https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606775242428_grasp_open_drawer.gif)
![However, the same robot fails to perform the task if the drawer is closed at the beginning of the episode.](https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1607289273270_fail_2.gif) -->
<p>However, we would like to enable our learned policy to execute the task from as many different initial conditions as possible. A simple new condition might consist of a closed drawer, while more complicated new conditions in which the drawer is blocked by an object, or by another drawer are also possible. Can we learn policies that can perform tasks from varied initial conditions?</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606781738830_grasp_closed_drawer.gif" width="250" style="margin: 2px;" />
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606781718828_blocking_object_grasp.gif" width="250" style="margin: 2px;" />
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606781704898_blocked_by_drawer_grasp.gif" width="250" style="margin: 2px;" />
<br />
<i>
Figure 7: From L to R: closed drawer, drawer blocked by an object, drawer blocked by another drawer.
</i>
</p>
<!-- ![Closed drawer](https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606781738830_grasp_closed_drawer.gif)
![Blocked by object](https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606781718828_blocking_object_grasp.gif)
![Blocked by another drawer](https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606781704898_blocked_by_drawer_grasp.gif) -->
<h2 id="formalizing-the-setting-leverage-task-agnostic-past-experience">Formalizing the Setting: Leverage Task-Agnostic Past Experience</h2>
<p>Similar to real-world scenarios where large unlabeled datasets are available alongside limited task-specific data, our agent is provided with two types of datasets. The task-specific dataset consists of behavior relevant for the task, but the prior dataset can consist of a number of random or scripted behaviors being executed in the same environment/setting. If a subset of this prior dataset is useful for extending our skill (shown in blue below), we can leverage it for learning a policy that can solve the task from new initial conditions. Note that not all prior data has to be useful for the downstream task (shown in red below), and we don’t need this prior dataset to have any explicit labels or rewards either. Our goal is to utilize both prior data and task-specific data to learn a policy that can execute the task from initial conditions that were unseen in the task data.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606785765415_prior_data_gif__1_720p_cropped_test.gif" width="800" style="margin: 2px;" />
<br />
<i>
Figure 8: COG utilizes prior data to learn a policy that can solve the task from initial conditions that were unseen in the task data, as long as a subset of the prior data contains behavior that helps extend the skill (shown in blue). Note that not all prior data needs to be in support of the downstream skill (shown in red), and we don’t need any reward labels for this dataset either.
</i>
</p>
<!-- ![Our method utilizes prior data to learn a policy that can solve the task from initial conditions that were unseen in the task data, as long as a subset of the prior data contains behavior that helps extend the skill (shown in blue). Note that not all prior data needs to be in support of the downstream skill (shown in red), and we don’t need any reward labels for this dataset either.](https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606785765415_prior_data_gif__1_720p_cropped_test.gif) -->
<h2 id="connecting-skills-via-offline-rl">Connecting skills via Offline RL</h2>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606786802731_corl2020_bellman_backup_arxiv_v2.png" width="800" style="margin: 2px;" />
<br />
<i>
Figure 9: The black arrows denote the dynamics of the MDP. The green arrows denote the propagation of Q-values from high reward states to states that are further back from the goal.
</i>
</p>
<!-- ![The black arrows denote the dynamics of the MDP. The green arrows denote the propagation of Q-values from high reward states to states that are further back from the goal.](https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606786802731_corl2020_bellman_backup_arxiv_v2.png) -->
<p>We start by running offline Q-learning (CQL) on the task data, which allows for Q-values to propagate from high rewards states to states that are further back from the goal. We then add the prior dataset to the training buffer, assigning all transitions a zero reward. Further (offline) dynamic programming on this expanded dataset allows Q-values to propagate to initial conditions that were unseen in the task data, giving rise to policies that are successful from new initial conditions. Note that there is no single trajectory in our dataset that solves the entire task from these new starting conditions, but offline Q-learning allows us to “stitch” together relevant sub-trajectories from prior and task data, without any additional supervision. We found that effective offline RL methods (e.g., CQL) are essential to obtain good performance, and prior off-policy or offline methods (e.g., <a href="https://arxiv.org/abs/1906.00949">BEAR</a>, <a href="https://arxiv.org/abs/1910.00177">AWR</a>) did not perform well on these tasks. Rollouts from our learned policy for the drawer grasping task are shown below. Our method is able to stitch together several behaviors to solve the downstream task. For example, in the second video below: the policy is able to pick a blocking object, put it away, open the drawer, and take an object out. Note that the agent is performing this task from image observations (shown in the top right corner), and receives a +1 reward only after it finishes the final step (rewards are equal to zero everywhere else).</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606781738830_grasp_closed_drawer.gif" width="250" style="margin: 2px;" />
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606781718828_blocking_object_grasp.gif" width="250" style="margin: 2px;" />
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606781704898_blocked_by_drawer_grasp.gif" width="250" style="margin: 2px;" />
<br />
<i>
Figure 10: The performance of our learned policy for novel initial conditions.
</i>
</p>
<h2 id="a-real-robot-result">A Real Robot Result</h2>
<p>We also evaluate our method on a real robot, where we see that our learned policy is able to open a drawer and take an object out, even though it never saw a single trajectory executing the entire task during training. Our method succeeds on <strong>7 out of 8</strong> trials, while our strongest baseline based on behavior cloning was unable to solve the task even for a single trial. Here are some example rollouts from our learned policy.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606788468558_unnamed+5.gif" width="250" style="margin: 2px;" />
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606788436610_unnamed+4.gif" width="250" style="margin: 2px;" />
<img src="https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606788411198_unnamed+3.gif" width="250" style="margin: 2px;" />
<br />
</p>
<!-- ![](https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606788468558_unnamed+5.gif)
![](https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606788436610_unnamed+4.gif)
![](https://paper-attachments.dropbox.com/s_6BD0395EE3D344A76D8A0841F64E4B1971F5F3F28F4F365C727D97F534CDB53D_1606788411198_unnamed+3.gif) -->
<h1 id="discussion-future-work-and-takeaways">Discussion, Future Work and Takeaways</h1>
<p>In the past year, we have taken steps towards developing offline RL algorithms that can better handle real world complexities like multi-modal data distributions, raw image observations, diverse, task-agnostic prior datasets, etc. However, several challenging problems remain open. Like supervised learning methods, offline RL algorithms can also “overfit” as a result of excessive training on the dataset. The nature of this “overfitting” is complex — it can manifest as both overly conservative and overly optimistic solutions. In a number of cases, this “overfitting” phenomenon gives rise to poorly-conditioned neural networks (e.g., networks that <a href="https://arxiv.org/abs/2010.14498">over-alias predictions</a>) and exact understanding of this phenomenon is currently missing. Thus, one interesting avenue for future work is to devise model-selection methods that can be used for policy checkpoint selection or early stopping, thereby mitigating this issue. Another avenue is to understand the causes behind the origin of this “overfitting” issue and use the insights to improve stability of offline RL algorithms directly.</p>
<p>Finally, as we gradually move towards real-world settings, related areas of self-supervised learning, representation learning, transfer learning, meta-learning etc. will be essential to apply in conjunction with offline RL algorithms, especially in settings with limited data. This naturally motivates several theoretical and empirical questions: Which representation learning schemes are optimal for offline RL methods? How well do offline RL methods work when using reward functions learned from data? What constitutes a set of tasks that is amenable to transfer in offline RL? We eagerly look forward to the progress in the area over the coming year.</p>
<hr />
<p>We thank Sergey Levine, George Tucker, Glen Berseth, Marvin Zhang, Dhruv Shah and Gaoyoue Zhou for their valuable feedback on earlier versions of this post.</p>
<p>This blog post is based on two papers to appear in NeurIPS conference/workshops this year. We invite you to come and discuss these topics with us at NeurIPS.</p>
<ul>
<li>
<p><strong>Conservative Q-Learning for Offline Reinforcement Learning</strong><br />
<strong>Aviral Kumar</strong>, Aurick Zhou, George Tucker, Sergey Levine.<br />
In Advances in Neural Information Processing Systems (NeurIPS), 2020.<br />
<a href="https://arxiv.org/abs/2006.04779">[paper]</a> <a href="https://github.com/aviralkumar2907/CQL">[code]</a> <a href="https://slideslive.com/38936555/conservative-qlearning-for-offline-reinforcement-learning?ref=speaker-19077-latest">[video]</a> <a href="https://sites.google.com/view/cql-offline-rl">[project page]</a></p>
</li>
<li>
<p><strong>COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning</strong><br />
<strong>Avi Singh</strong>, Albert Yu, Jonathan Yang, Jesse Zhang, <strong>Aviral Kumar</strong>, Sergey Levine.<br />
In Conference on Robotic Learning (CoRL) 2020.<br />
Contributed Talk at the Offline RL Workshop, NeurIPS 2020.<br />
<a href="https://arxiv.org/abs/2010.14500">[paper]</a> <a href="https://github.com/avisingh599/cog">[code]</a> <a href="https://www.youtube.com/watch?v=6sb31PtpI_s&feature=youtu.be&ab_channel=AviSingh">[video]</a> <a href="https://sites.google.com/view/cog-rl">[project page]</a></p>
</li>
</ul>
Mon, 07 Dec 2020 01:00:00 -0800
https://bairblog.github.io/2020/12/07/offline/
https://bairblog.github.io/2020/12/07/offline/Learning State Abstractions for Long-Horizon Planning<meta name="twitter:title" content="Learning State Abstractions for Long-Horizon Planning" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/sgm/fig1.png" />
<p>Many tasks that we do on a regular basis, such as navigating a city, cooking a
meal, or loading a dishwasher, require planning over extended periods of time.
Accomplishing these tasks may seem simple to us; however, reasoning over long
time horizons remains a major challenge for today’s Reinforcement Learning (RL)
algorithms. While unable to plan over long horizons, deep RL algorithms excel
at learning policies for short horizon tasks, such as robotic grasping,
directly from pixels. At the same time, classical planning methods such as
Dijkstra’s algorithm and A$^*$ search can plan over long time horizons, but
they require hand-specified or task-specific abstract representations of the
environment as input.</p>
<p>To achieve the best of both worlds, state-of-the-art visual navigation methods
have applied classical search methods to learned graphs. In particular, SPTM [2]
and SoRB [3] use a replay buffer of observations as nodes in a graph and learn
a parametric distance function to draw edges in the graph. These methods have
been successfully applied to long-horizon simulated navigation tasks that were
too challenging for previous methods to solve.</p>
<!--more-->
<p>Nevertheless, these methods are still limited because they are highly sensitive
to errors in the learned graph. Even a single faulty edge acts like a wormhole
in the graph topology that planning algorithms try to exploit, which makes
existing methods that combine graph search and RL extremely brittle. For
example, if an artificial agent navigating a maze thinks that two observations
on either side of a wall are nearby, its plans will involve transitions that
collide into the wall. Adopting a simple model that assumes a constant
probability $p$ of each edge being faulty, we see that the expected number of
faulty edges is $p|E| = O(|V|^2)$. In other words, <em>errors in the graph scale
quadratically with the number of nodes in the graph</em>.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/sgm/fig1.png" width="70%" />
<br />
</p>
<p>We could do a lot better if we could minimize the errors in the graph. But how?
Graphs over observations in both simulated and real-world environments can be
prohibitively large, making it challenging to even identify which edges are
faulty. To minimize errors in the graph, then, we desire sparsity; we want to
keep a minimal set of nodes that is sufficient for planning. If we have a way
to aggregate similar observations into a single node in the graph, we can
reduce the number of errors and improve the accuracy of our plans. The key
challenge is to aggregate observations in a way that respects temporal
constraints. If observations are similar in appearance but actually far away,
then they should be aggregated into different nodes.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/sgm/fig2.png" width="70%" />
<br />
</p>
<p>So how can we sparsify our graph while guaranteeing that the graph remains
useful for planning? Our key insight is a novel merging criterion called
<em>two-way consistency</em>. Two-way consistency can be viewed as a generalization of
value irrelevance to the goal-conditioned setting. Intuitively, two-way consistency
merges nodes (i) that can be interchanged as starting states and (ii) that can be
interchanged as goal states.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/sgm/fig3.png" width="70%" />
<br />
</p>
<p>For an example of two-way consistency, consider the above figure. Suppose
during our node merging procedure we ask: can we merge the nodes with pink and
orange bottles according to two-way consistency? First, we note that moving
from the blue bottle to the pink bottle requires roughly the same work as
moving from the blue bottle to the orange bottle. So the nodes with pink and
orange bottles satisfy criterion (ii) because they can be interchanged as goal
states. However, while it is possible to start from the pink bottle and move to
the blue bottle, if we instead start at the orange bottle, the orange bottle
will fall to the floor and crash! So the nodes with pink and orange bottles
fail criterion (i) because they cannot be interchanged as starting states.</p>
<p>In practice, we can’t expect to encounter two nodes that can be perfectly
interchanged. Instead, we merge nodes that can be interchanged up to a
threshold parameter $\tau$. By increasing $\tau$, we can make the resulting
graph as sparse as we’d like. Crucially, *we prove in the paper that merging
according to two-way consistency preserves the graph’s quality up to an error
term that scales only linearly with the merging threshold $\tau$.</p>
<p>Our motivation for sparsity, discussed above, is robustness: we expect smaller
graphs to have fewer errors. Furthermore, our main theorem tells us that we can
merge nodes according to two-way consistency while preserving the graph’s
quality. Experimentally, though, are the resulting sparse graphs more robust?</p>
<p>To test the robustness of Sparse Graphical Memory to errors in learned distance
metrics, we thinned the walls in the PointEnv mazes of [3]. While PointEnv is a
simple environment with $(x, y)$ coordinate observations, thinning the walls is
a major challenge for parametric distance functions; any error in the learned
distance function will cause faulty edges across the walls that destroy the
feasibility of plans. For this reason, simply thinning the maze walls is enough
to break the previous state-of-the-art [3] resulting in a 0% success rate.</p>
<p>How does Sparse Graphical Memory fare? With many fewer edges, it becomes
tractable to perform self-supervised cleanup: the agent can step through the
environment to detect and remove faulty edges from its graph. The below figure
illustrates the results of this process. While the dense graph shown in red has
many faulty edges, sparsity and self-supervised cleanup, shown in green,
overcome errors in the learned distance metric, leading to a 100% success rate.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/sgm/fig4.png" width="100%" />
<br />
</p>
<p>We see a similar trend in experiments with visual input. In both ViZDoom [4]
and SafetyGym [5] – maze navigation tasks that require planning from raw
images – Sparse Graphical Memory consistently improves the success of baseline
methods including SoRB [3] and SPTM [2].</p>
<p>In addition to containing fewer errors, Sparse Graphical Memory also results in
more optimal plans. On a ViZDoom maze navigation task [4], we find that SGM
requires significantly less steps to reach the final goal across easy, medium,
and hard maze tasks, meaning that the agent follows a shorter path to the final
destination.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/sgm/fig5.png" width="50%" />
<br />
</p>
<p>Overall, we found that state aggregation with two-way consistency resulted in
substantially more robust plans over the prior state-of-the-art. While
promising, many open questions and challenges remain for combining classical
planning with learning-based control. Some of the questions we’re thinking
about are - how can we extend these methods beyond navigation to manipulation
domains? As the world is not static, how should we build graphs over changing
environments? How can two-way consistency be utilized beyond the scope of
graphical-based planning methods? We are excited about these future directions
and hope our theoretical and experimental findings prove useful to other
researchers investigating control over extended time horizons.</p>
<p><strong>References</strong></p>
<font size="-1">
</font>
<ol><font size="-1">
<li>Emmons*, Jain*, Laskin* et al. <a href="https://arxiv.org/abs/2003.06417">Sparse Graphical Memory for Robust Planning</a>. NeurIPS 2020.</li>
<li>Savinov et al. <a href="https://arxiv.org/abs/1803.00653">Semi-parametric Topological Memory for Navigation</a>. ICLR 2019.</li>
<li>Eysenbach et al. <a href="https://arxiv.org/abs/1906.05253">Search on the Replay Buffer: Bridging Planning and Reinforcement Learning</a>. NeurIPS 2020.</li>
<li>Wydmuch et al. <a href="https://arxiv.org/abs/1809.03470">ViZDoom Competitions: Playing Doom from Pixels</a>. IEEE Transactions on Games, 2018.</li>
<li>Ray et al. <a href="https://cdn.openai.com/safexp-short.pdf">Benchmarking Safe Exploration in Deep Reinforcement Learning</a>. Preprint, 2019.</li>
</font></ol>
<font size="-1">
</font>
Fri, 20 Nov 2020 01:00:00 -0800
https://bairblog.github.io/2020/11/20/sgm/
https://bairblog.github.io/2020/11/20/sgm/EvolveGraph: Dynamic Neural Relational Reasoning for Interacting Systems<meta name="twitter:title" content="EvolveGraph: Dynamic Neural Relational Reasoning for Interacting Systems" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/evolvegraph/figure2.png" />
<p>Multi-agent interacting systems are prevalent in the world, from purely physical systems to complicated social dynamic systems. The interactions between entities / components can give rise to very complex behavior patterns at the level of both individuals and the multi-agent system as a whole. Since usually only the trajectories of individual entities are observed without any knowledge of the underlying interaction patterns, and there are usually multiple possible modalities for each agent with uncertainty, it is challenging to model their dynamics and forecast their future behaviors.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/evolvegraph/figure1.png" width="80%" />
<br />
<i>Figure 1. Typical multi-agent interacting systems.</i>
</p>
<p>In many real-world applications (e.g. autonomous vehicles, mobile robots), an effective understanding of the situation and accurate trajectory prediction of interactive agents play a significant role in downstream tasks, such as decision making and planning. We introduce a generic trajectory forecasting framework (named EvolveGraph) with explicit relational structure recognition and prediction via latent interaction graphs among multiple heterogeneous, interactive agents. Considering the uncertainty of future behaviors, the model is designed to provide multi-modal prediction hypotheses. Since the underlying interactions may evolve even with abrupt changes over time, and different modalities of evolution may lead to different outcomes, we address the necessity of dynamic relational reasoning and adaptively evolving the interaction graphs.</p>
<!--more-->
<h2 id="challenges-of-multi-agent-behavior-prediction">Challenges of Multi-Agent Behavior Prediction</h2>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/evolvegraph/figure2.png" width="80%" />
<br />
<i>Figure 2. An illustration of a typical urban intersection scenario.</i>
</p>
<p>We use an urban intersection scenario with multiple interacting traffic participants as an illustrative example to elaborate on the major challenges of the multi-agent behavior prediction task.</p>
<ul>
<li>
<p>First, there may be heterogeneous agents that have distinct behavior patterns, thus using a homogeneous dynamics / behavior model may not be sufficient. For example, there are different constraints and traffic rules for vehicles and pedestrians. More specifically, vehicle trajectories are strictly constrained by road geometry and their own kinematic models; while pedestrian behaviors are much more flexible.</p>
</li>
<li>
<p>Second, there may be various types of interaction patterns in a multi-agent system. For example, the inter-vehicle interaction, inter-pedestrian interaction, and vehicle-pedestrian interaction in the same scenario present very different patterns.</p>
</li>
<li>
<p>Third, the interaction patterns may evolve over time as the situation changes. For example, when a vehicle is going straight, it only needs to consider the behavior of the leading vehicle; however, when the vehicle plans to change lanes, the vehicles in the target lane are also necessary to be taken into account, which leads to a change in the interaction patterns.</p>
</li>
<li>
<p>Last but not least, there may be uncertainties and multi-modalities in the future behaviors of each agent, which leads to various outcomes. For example, in an intersection, the vehicle may either go straight or take a turn.</p>
</li>
</ul>
<p>In this work, we took a step forward to handle these challenges and provided a generic framework for trajectory prediction with dynamic relational reasoning for multi-agent systems. More specifically, we address the problem of</p>
<ul>
<li>extracting the underlying interaction patterns with a latent graph structure, which is able to handle different types of agents in a unified way,</li>
<li>capturing the dynamics of interaction graph evolution for dynamic relational reasoning,</li>
<li>predicting future trajectories (state sequences) based on the historical observations and the latent interaction graph, and</li>
<li>capturing the uncertainty and multi-modality of future system behaviors.</li>
</ul>
<h2 id="relational-reasoning-with-graph-representation">Relational Reasoning with Graph Representation</h2>
<h3 id="observation-graph-and-interaction-graph">Observation Graph and Interaction Graph</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/evolvegraph/figure3.png" width="60%" />
<br />
<i>Figure 3. An illustration of the observation graph and interaction graph.</i>
</p>
<p>The multi-agent interacting system is naturally represented by a graph, where agents are considered as nodes and their relations are considered as edges. We have two types of graphs for different purposes, which are introduced below:</p>
<ul>
<li><strong>Observation Graph</strong>: The observation graph aims to extract feature embeddings from raw observations, which consists of N agent nodes and one context node. Agent nodes are bidirectionally connected to each other, and the context node only has outgoing edges to each agent node. Each agent node has two types of attributes: self-attribute and social-attribute. The former only contains the node’s own state information, while the latter only contains other nodes’ state information.</li>
<li><strong>Interaction Graph</strong>: We use different edge types to represent distinct interaction patterns. No edge between a pair of nodes means that the two nodes have no relation. The interaction graph represents interaction patterns with a distribution of edge types for each edge, which is built on top of the observation graph.</li>
</ul>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/evolvegraph/figure4.png" width="80%" />
<br />
<i>Figure 4. A high-level graphical illustration of EvolveGraph.</i>
</p>
<h3 id="dynamic-interaction-graph-learning">Dynamic Interaction Graph Learning</h3>
<p>In many situations, the interaction patterns recognized from the past time steps are likely not static in the future. Moreover, many interaction systems have multi-modal properties in nature. Different modalities afterwards are likely to result in different interaction patterns and outcomes. Therefore, we designed a dynamic evolving process of the interaction patterns.</p>
<p>As illustrated in Figure 4, the encoding process is repeated every τ (re-encoding gap) time steps to obtain the latent interaction graph based on the latest observation graph. A recurrent unit (GRU) is utilized to maintain and propagate the history information, as well as to adjust the prior interaction graphs. More details can be found in <a href="https://papers.nips.cc/paper/2020/hash/e4d8163c7a068b65a64c89bd745ec360-Abstract.html">our paper</a>.</p>
<h3 id="uncertainty-and-multi-modality">Uncertainty and Multi-Modality</h3>
<p>Here we emphasize the efforts to encourage diverse and multi-modal trajectory prediction and generation. In our framework, the uncertainty and multi-modality mainly come from three aspects:</p>
<ul>
<li>First, in the decoding process, we output Gaussian mixture distributions indicating that there are several possible modalities at the next step. We only sample a single Gaussian component at each step based on the component weights which indicate the probability of each modality.</li>
<li>Second, different sampled trajectories will lead to different interaction graph evolution. Evolution of interaction graphs contributes to the multi-modality of future behaviors, since different underlying relational structures enforce different regulations on the system behavior and lead to various outcomes.</li>
<li>Third, directly training such a model, however, tends to collapse to a single mode. Therefore, we employ an effective mechanism to mitigate the mode collapse issue and encourage multi-modality. During training, we run the decoding process d times, which generates \(d\) trajectories for each agent under specific scenarios. We only choose the prediction hypothesis with the minimal loss for backpropagation, which is the most likely to be in the same mode as the ground truth. The other prediction hypotheses may have much higher loss, but it doesn’t necessarily imply that they are implausible. They may represent other potential reasonable modalities.</li>
</ul>
<h2 id="experiments">Experiments</h2>
<p>We highlight the results of two case studies on a synthetic physics system and an urban driving scenario. More experimental details and case studies on pedestrians and sports players can be found in <a href="https://papers.nips.cc/paper/2020/hash/e4d8163c7a068b65a64c89bd745ec360-Abstract.html">our paper</a>.</p>
<h3 id="case-study-1-particle-physics-system">Case Study 1: Particle Physics System</h3>
<p>We experimented with a simulated particle system with a change of relations. Multiple particles are initially linked and move together. The links disappear as long as a certain criterion on the particle state is satisfied and the particles move independently thereafter. The model is expected to learn the criterion by itself and perform both edge type prediction and trajectory prediction. Since the system is deterministic in nature, we do not consider multi-modality in this task.</p>
<p>We predicted the particle states at the future 50 time steps based on the observations of 20 time steps. We set two edge types in this task, which correspond to “with link” and “without link”. The results of edge type prediction are summarized in Table 1, which are averaged over 3 independent runs. “No Change” means the underlying interaction structure keeps the same in the whole horizon, while “Change” means the change of interaction patterns happens at some time. It shows that the supervised learning baseline, which directly trains the encoding functions with ground truth labels, performs the best in both setups and serves as a “gold standard”. Under the “No Change” setup, <a href="https://arxiv.org/abs/1802.04687">NRI (dynamic)</a> is comparable to EvolveGraph (RNN re-encoding), while EvolveGraph (static) achieves the best performance. The reason is that the dynamic evolution of the interaction graph leads to higher flexibility but may result in larger uncertainty, which affects edge prediction in the systems with static relational structures. Under the “Change” setup, NRI (dynamic) re-evaluates the latent graph at every time step during the testing phase, but it is hard to capture the dependency between consecutive graphs, and the encoding functions may not be flexible enough to capture the evolution. EvolveGraph (RNN re-encoding) performs better since it considers the dependency of consecutive steps during the training phase, but it still captures the evolution only at the feature level instead of the graph level. EvolveGraph (dynamic) achieves significantly higher accuracy than the other baselines (except Supervised), due to the explicit evolution of interaction graphs.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/evolvegraph/table1.png" width="80%" />
<br />
<i></i>
</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/evolvegraph/figure5.png" width="80%" />
<br />
<i>Figure 5. Visualization of latent interaction graph evolution and particle trajectories. (a) The top two figures show the probability of the first edge type ("with link") at each time step. Each row corresponds to a certain edge (shown in the right). The actual times of graph evolution are 54 and 62, respectively. The model is able to capture the underlying criterion of relation change and further predict the change of edge types with nearly no delay. (b) The figures in the last row show trajectory prediction results, where semi-transparent dots are historical observations.</i>
</p>
<h3 id="case-study-2-traffic-scenarios">Case Study 2: Traffic Scenarios</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/evolvegraph/table2.png" width="80%" />
<br />
<i></i>
</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/evolvegraph/figure6.png" width="80%" />
<br />
<i>Figure 6. Visualization of testing cases in traffic scenarios. Dashed lines are historical trajectories, solid lines are ground truth, and dash-dotted lines are prediction hypotheses. White areas represent drivable areas and gray areas represent sidewalks. We plotted the prediction hypothesis with the minimal average prediction error, and the heatmap to represent the distributions.</i>
</p>
<p>We predicted the future 10 time steps (4.0s) based on the historical 5 time steps (2.0s). The comparison of quantitative results is shown in Table 2, where the unit of reported \(minADE_{20}\) and \(minFDE_{20}\) is meters in the world coordinates. All the baseline methods consider the relations and interactions among agents. The <a href="https://arxiv.org/abs/1710.04689">Social-Attention</a> employs spatial attention mechanisms, while the <a href="https://openaccess.thecvf.com/content_cvpr_2018/papers/Gupta_Social_GAN_Socially_CVPR_2018_paper.pdf">Social-GAN</a> demonstrates a deep generative model which learns the data distribution to generate human-like trajectories. The <a href="https://openaccess.thecvf.com/content_ICCV_2019/html/Choi_Looking_to_Relations_for_Future_Trajectory_Forecast_ICCV_2019_paper.html">Gated-RN</a> and <a href="https://arxiv.org/abs/2001.03093">Trajectron++</a> both leverage spatio-temporal information to involve relational reasoning, which leads to smaller prediction error. The <a href="https://arxiv.org/abs/1802.04687">NRI</a> infers a latent interaction graph and learns the dynamics of agents, which achieves similar performance to Trajectron++. The <a href="https://openaccess.thecvf.com/content_ICCV_2019/papers/Huang_STGAT_Modeling_Spatial-Temporal_Interactions_for_Human_Trajectory_Prediction_ICCV_2019_paper.pdf">STGAT</a> and <a href="https://openaccess.thecvf.com/content_CVPR_2020/papers/Mohamed_Social-STGCNN_A_Social_Spatio-Temporal_Graph_Convolutional_Neural_Network_for_Human_CVPR_2020_paper.pdf">Social-STGCNN</a> further take advantage of the graph neural network to extract relational features in the multi-agent setting. Our proposed method achieves the best performance, which implies the advantages of explicit interaction modeling via evolving interaction graphs. The 4.0s \(minADE_{20}\) / \(minFDE_{20}\) are significantly reduced by 20.0% / 27.1% compared to the best baseline approach (STGAT).</p>
<p>The visualization of some testing cases is provided in Figure 6. Our framework can generate accurate and plausible trajectories. More specifically, in the top left case, for the blue prediction hypothesis at the left bottom, there is an abrupt change at the fifth prediction step. This is because the interaction graph evolved at this step. Moreover, in the heatmap, there are multiple possible trajectories starting from this point, which represent multiple potential modalities. These results show that the evolving interaction graph can reinforce the multi-modal property of our model since different samples of trajectories at the previous steps lead to different directions of graph evolution, which significantly influences the prediction afterwards. In the top right case, each car may leave the roundabout at any exit. Our model can successfully show the modalities of exiting the roundabout and staying in it. Moreover, if exiting the roundabout, the cars are predicted to exit on their right, which implies that the modalities predicted by our model are plausible and reasonable.</p>
<h2 id="summary-and-broader-applications">Summary and Broader Applications</h2>
<p>We introduce EvolveGraph, a generic trajectory prediction framework with dynamic relational reasoning, which can handle evolving interacting systems involving multiple heterogeneous, interactive agents. The proposed framework could be applied to a wide range of applications, from purely physical systems to complex social dynamics systems. In this blog, we demonstrate some illustrative applications to physics objects and traffic participants. The framework could also be applied to analyze and predict the evolution of larger interacting systems, such as complex physical systems with a large number of interacting components, social networks, and macroscopical traffic flows. Although there are existing works using graph neural networks to handle trajectory prediction tasks, here we emphasize the impact of using our framework to recognize and predict the evolution of the underlying relations. With accurate and reasonable relational structures, we can forecast or generate plausible system behaviors, which help much with optimal decision making or other downstream tasks.</p>
<p><strong>Acknowledgements</strong>: We thank all the co-authors of the paper “EvolveGraph: Multi-Agent Trajectory Prediction with Dynamic Relational Reasoning” for their contributions and discussions in preparing this blog. The views and opinions expressed in this blog are solely of the authors.</p>
<p>This blog post is mainly based on the following paper:</p>
<p>EvolveGraph: Multi-Agent Trajectory Prediction with Dynamic Relational Reasoning<br />
Jiachen Li*, Fan Yang*, Masayoshi Tomizuka, and Chiho Choi<br />
Advances in Neural Information Processing Systems (NeurIPS), 2020<br />
<a href="https://papers.nips.cc/paper/2020/hash/e4d8163c7a068b65a64c89bd745ec360-Abstract.html">Proceedings</a>, <a href="https://arxiv.org/abs/2003.13924">Preprint</a>, <a href="https://jiachenli94.github.io/publications/Evolvegraph/">Project Website</a>, <a href="https://github.com/jiachenli94/Awesome-Interaction-aware-Trajectory-Prediction">Resources</a></p>
<p>Some other related works are listed as follows:</p>
<p>Conditional Generative Neural System for Probabilistic Trajectory Prediction<br />
Jiachen Li, Hengbo Ma, and Masayoshi Tomizuka<br />
IEEE/RSJ International Conference on Robotics and Systems (IROS), 2019<br />
<a href="https://ieeexplore.ieee.org/abstract/document/8967822">Proceedings</a>,
<a href="https://arxiv.org/abs/1905.01631">Preprint</a></p>
<p>Interaction-aware Multi-agent Tracking and Probabilistic Behavior Prediction via Adversarial Learning<br />
Jiachen Li*, Hengbo Ma*, and Masayoshi Tomizuka<br />
IEEE International Conference on Robotics and Automation (ICRA), 2019<br />
<a href="https://ieeexplore.ieee.org/abstract/document/8793661">Proceedings</a>,
<a href="https://arxiv.org/abs/1904.02390">Preprint</a></p>
<p>Generic Tracking and Probabilistic Prediction Framework and Its Application in Autonomous Driving<br />
Jiachen Li, Wei Zhan, Yeping Hu, and Masayoshi Tomizuka<br />
IEEE Transactions on Intelligent Transportation Systems, 2020<br />
<a href="https://ieeexplore.ieee.org/document/8789525">Article</a>,
<a href="https://arxiv.org/abs/1908.09031">Preprint</a></p>
Wed, 18 Nov 2020 01:00:00 -0800
https://bairblog.github.io/2020/11/18/evolvegraph/
https://bairblog.github.io/2020/11/18/evolvegraph/Training on Test Inputs with Amortized Conditional Normalized Maximum Likelihood<meta name="twitter:title" content="Training on Test Inputs with Amortized Conditional Normalized Maximum Likelihood" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/acnml/fig6_alg_summary.JPG" />
<p>Current machine learning methods provide unprecedented accuracy across a range
of domains, from computer vision to natural language processing. However, in
many important high-stakes applications, such as medical diagnosis or
autonomous driving, rare mistakes can be extremely costly, and thus effective
deployment of learned models requires not only high accuracy, but also a way to
measure the certainty in a model’s predictions. Reliable uncertainty
quantification is especially important when faced with out-of-distribution
inputs, as model accuracy tends to degrade heavily on inputs that differ
significantly from those seen during training. In this blog post, we will
discuss how we can get reliable uncertainty estimation with a strategy that
does not simply rely on a learned model to extrapolate to out-of-distribution
inputs, but instead asks: “given my training data, which labels would make
sense for this input?”.</p>
<!--more-->
<p>To illustrate how this can allow for more reasonable predictions on
out-of-distribution data, consider the following example where we attempt to
classify automobiles, where all the class 1 training examples are sedans and
class 2 examples are large buses.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/acnml/fig1_ambiguous_classification_example.JPG" />
<br />
<i>
Figure 1: Given previously seen examples, it is uncertain what the label for
the new query point should be. Different classifiers that work well on the
training set can give different predictions on the query point.
</i>
</p>
<p>A classifier could potentially fit the training labels correctly based on
several different explanations; for example, it could notice that buses are all
longer than sedans and classify accordingly, or it could perhaps pay attention
to the height of the vehicle instead. However, if we try to simply extrapolate
to an out-of-distribution image of a limousine, the classifier’s output could
be unpredictable and arbitrary. A classifier based on length could note that
the limousine is similar to the buses in its length and confidently predict
class 2, while a classifier utilizing the height could confidently predict
class 1. Based only on the training set, there is not enough information to
accurately decide which class a limousine should fit into, so we would ideally
want our classifier to indicate uncertainty instead of providing arbitrary
confident predictions for either class. On the other hand, if we explicitly try
to find models that explain each potential label, we would find reasonable
explanations for either label, suggesting that we should be uncertain about
predicting which class the limousine belongs to.</p>
<p>We can instantiate this reasoning with an algorithm that, for every possible
label, explicitly updates the model to try to explain that label for the query
point and combines the different models to obtain well-calibrated predictions
for out-of-distribution inputs. In this blog post, we will motivate and
introduce
<a href="https://arxiv.org/abs/2011.02696">amortized conditional normalized maximum likelihood</a>
(ACNML), a practical instantiation of this idea that enables reliable
uncertainty estimation with deep neural networks.</p>
<h1 id="conditional-normalized-maximum-likelihood">Conditional Normalized Maximum Likelihood</h1>
<p>Our method extends a prediction strategy from the minimum description length
(MDL) literature known as <em>conditional normalized maximum likelihood</em> (CNML),<sup id="fnref:cnml" role="doc-noteref"><a href="#fn:cnml" class="footnote">1</a></sup>
which has been studied for its theoretical properties, but is computationally
intractable for all but the simplest problems. We will first review CNML and
discuss how its predictions can lead to conservative uncertainty estimates. We
will then describe our method, which allows for a practical application of CNML
to obtain uncertainty estimates for deep neural networks.</p>
<p>The CNML distribution is derived from achieving a notion of minimax optimality,
where we define a notion of regret for each label and choose the distribution
that minimizes the worst case regret over labels. Given a training set
$D_{\rm train}$, a query input $x$, and a set of potential models $\Theta$, we
define the regret for each label to be the difference between the negative
log-likelihood loss for our distribution and the loss under the model that best
fits the training dataset together with the query point and label.</p>
<!--
Daniel Seita: this equation takes up too much horizontal space.
$$
\begin{align}
\hat \theta_y &= \arg \max_{\theta} \log p_{\theta}(y\vert x) + \log p_{\theta}(D_{\text{train}}) && \text{Model that best explains label }y \text{ and the training data}\\
R(q, y) &= \log q(y) - \log p_{\hat \theta_y}(y\vert x) && \text{Regret for label }y \text{ of a distribution }q \\
p_{\text{CNML}} &= \arg \min_q \max_y R(q, y) && \text{CNML minimizes the worst-case regret over labels} \\
\end{align}
$$
-->
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/acnml/eq1_equation_cnml_defs.JPG" />
<br />
</p>
<p>Intuitively, minimizing the worst case regret over labels then ensures our
predictive distribution is conservative, as it cannot assign low probabilities
to any labels that appear consistent with our training data, where consistency
is determined by the model class.</p>
<p>The minimax optimal distribution given a particular input $x$ and training set
$\mathcal D$ can be explicitly computed as follows:</p>
<ol>
<li>
<p>For each label $y$, we append $(x,y)$ to our training set and compute the
new optimal parameters $\hat \theta_y$ for this modified training set.</p>
</li>
<li>
<p>Use $\hat \theta_y$ to assign probability for that label.</p>
</li>
<li>
<p>Since these probabilities will now sum to a number greater than 1, we
normalize to obtain a valid distribution over labels.</p>
</li>
</ol>
\[p_{\text{CNML}} = \frac{p_{\hat \theta_y}(y\vert x)}{\sum_{y’} p_{\hat \theta_{y’}}(y’\vert x)}\]
<!--
Daniel: no need for the image, equation above works fine.
<img src="https://bair.berkeley.edu/static/blog/acnml/eq2_equation_cnml_expression.JPG">
-->
<p>CNML has the interesting property that it explicitly optimizes the model to
make predictions on the query input, which can lead to more reasonable
predictions than simply extrapolating using a model obtained only from the
training set. It can also lead to more conservative predictions on
out-of-distribution inputs, since it would be easier to fit different labels
for out-of-distribution points without affecting performance on the training
set.</p>
<p>We illustrate CNML with a 2-dimensional logistic regression example. We compare
heatmaps of CNML probabilities with the maximum likelihood classifier to
illustrate how CNML provides conservative uncertainty estimates on points away
from the data. With this model class, CNML expresses uncertainty and assigns a
uniform distribution to any query point where the dataset remains linearly
separable (meaning there exists a linear decision boundary could correctly
classify all data points) regardless of which label was assigned for the query
point.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/acnml/fig2_logistic_example_cnml_vs_mle.JPG" />
<br />
<i>
Figure 2: Here, we show the heatmap of CNML predictions (left) and the
predictions of the training set MLE $\hat \theta_{\text{train}}$ (right). The
training inputs are shown with blue (class 0) and orange (class 1) dots. Blue
shading indicates that higher probability for class 0 on that input and red
shading indicates higher probability to class 1, with darker colors indicating
more confident predictions. We note that while the original classifier assigns
confident predictions for most inputs, CNML assigns close to uniform for most
points between the two clusters of training points, indicating high uncertainty
about these ambiguous inputs.
</i>
</p>
<p>We illustrate how CNML computes these probabilities by illustrating the base
classifier predictions under parameters $\hat \theta_{\text{train}}$ (the
training set MLE), as well as $\hat \theta_{\text{0}}$, $\hat
\theta_{\text{1}}$, the parameters computed by CNML after assigning the
labels 0 and 1 respectively to a query point.</p>
<p>We first consider an out-of-distribution query point far away from any of the
training inputs (shown in pink in the bottom of the leftmost image). In the
left image, we see the original decision boundary for $\hat
\theta_{\text{train}}$ confidently classifies the query point as class 0. In
the middle, we see the decision boundary of $\hat \theta_0$ similarly
classifies the query point as class 0. However, we see in the rightmost image
that $\hat \theta_1$ is able to confidently classify the query point as class 1.
Since $\hat \theta_0$ confidently predicts class 0 for the query point and
$\hat \theta_1$ confidently predicts class 1, CNML normalizes the two
predictions to assign roughly equal probability to either label.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/acnml/fig3_cnml_ood_mles.JPG" />
<br />
<i>
Figure 3: Query point shown in pink. Both $\hat \theta_{\text{train}}$ and
$\hat \theta_0$ classify the query point as class 0, but $\hat \theta_1$ is
able to classify it as class 1.
</i>
</p>
<p>On the other hand, for an in-distribution query point (again shown in pink) in
the middle of the class 0 training inputs, no linear classifier can fit a label
of 1 to the query point while still accurately fitting the rest of the training
data, so the CNML distribution still confidently predicts class 0 on this query
point.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/acnml/fig4_cnml_indist_mles.JPG" />
<br />
<i>
Figure 4: Query point shown in pink. All parameters are forced to classify the
query point as class 0 since it is in the middle of the class 0 training
points.
</i>
</p>
<h2 id="controlling-conservativeness-via-regularization">Controlling Conservativeness via Regularization</h2>
<p>We see in Figure 2 that CNML probabilities are uniform on most of the input
space, arguably being <em>too conservative</em>. In this case, the model class is in
some sense too expressive, as linear predictors with large coefficients that
can assign arbitrarily high probabilities to each label as long as the data
remains linearly separable. This problem is exacerbated with even more
expressive model classes like deep neural networks, which can potentially fit
arbitrary labels. In order to have CNML give more useful predictions, we would
need to constrain the allowed set of models to better reflect our notion of
reasonable models.</p>
<p>To accomplish this, we generalize CNML to incorporate regularization via a
prior term, resulting in conditional
<a href=" https://papers.nips.cc/paper/2005/hash/952c3ff98a6acdc36497d839e31aa57c-Abstract.html">normalized maximum a posteriori</a>
(CNMAP) instead. Instead of computing maximum likelihood
parameters for the training dataset and the new input and label, we compute
maximum a posteriori (MAP) solutions instead, with the prior term $p(\theta)$
serving as a regularizer to limit the complexity of the selected model.</p>
\[\hat \theta_y = \arg \max_\theta \log p_{\theta}(y \vert x) + \log
p_{\theta}(D_{\text{train}}) + \log p(\theta)\]
<!--
Daniel Seita: above equation in LaTeX is fine.
<img src="https://bair.berkeley.edu/static/blog/acnml/eq3_cnmap.JPG">
-->
<p>Going back to the logistic regression example, we add different levels of L2
regularization to the parameters (corresponding to Gaussian priors) and plot
CNMAP probabilities in Figure 3 below. As regularization increases, CNML
becomes less conservative, with the assigned probabilities transitioning much
more smoothly as one moves away from the training points.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/acnml/fig5_cnmap_logistic.jpg" />
<br />
<i>
Figure 5: Heatmaps of CNMAP probabilities under varying amounts of
regularization $\lambda \|w\|_2^2$. Increasing regularization leads to less
conservative predictions.
</i>
</p>
<h2 id="computational-intractability">Computational Intractability</h2>
<p>While we see that CNML is able to provide conservative predictions for OOD
inputs, computing CNML predictions requires retraining the model using the
entire training set multiple times <strong>for each test input</strong>, which can be very
computationally expensive. While explicitly computing CNML distributions was
feasible in our logistic regression example with small datasets, it would be
computationally intractable to compute CNML naively with datasets consisting of
thousands of images and using deep convolutional neural networks, as retraining
the model just once could already take many hours. Even initializing from the
solution to the training set and finetuning for several epochs after adding the
query input and label could still take several minutes per input, rendering it
impractical to use with deep neural networks and large datasets.</p>
<h1 id="amortized-conditional-normalized-maximum-likelihood">Amortized Conditional Normalized Maximum Likelihood</h1>
<p>Since exactly computing CNML or CNMAP distributions is computationally
infeasible in deep learning settings due to the need to optimize over large
datasets for each new input and label, we need a tractable approximation. In
our method, <em>amortized conditional normalized maximum likelihood</em> (ACNML), we
utilize approximate Bayesian posteriors to capture necessary information about
the training data in order to efficiently compute the MAP/MLE solutions for
each datapoint. ACNML <em>amortizes</em> the costs of repeatedly optimizing over the
training set by first computing an approximate Bayesian posterior, which serves
as a compact approximation to the training losses.</p>
<h2 id="cnmap-and-bayesian-posteriors">CNMAP and Bayesian Posteriors</h2>
<p>We note that the main computational bottleneck is the need to optimize over the
entire training set for each query point. In order to sidestep this issue, we
first show a relationship between the MAP parameters needed in CNMAP and
Bayesian posterior densities:</p>
\[\hat \theta_y = \arg \max_\theta \log p_{\theta}(y \vert x) + \underbrace{\log p_{\theta}(D_{\text{train}}) + \log p(\theta)}_{\text{equal to }\log p(\theta \vert D_{\text{train}})}\]
<!--
$$\hat \theta_y = \arg \max_\theta \log p_{\theta}(y \vert x) + \log p_{\theta}(D_{\text{train}}) + \log p(\theta)$$
Daniel Seita: above equation in LaTeX is fine.
<img src="https://bair.berkeley.edu/static/blog/acnml/eq4_cnmap_bayes.JPG">
-->
<p>Rather than computing optimal parameters for the new query point and the
training set, we can reformulate CNMAP as optimizing over just the query point
and a posterior density. With a uniform prior (equivalent to having no
regularizer), we can recover the maximum likelihood parameters to perform CNML
if desired.<sup id="fnref:acnml" role="doc-noteref"><a href="#fn:acnml" class="footnote">2</a></sup></p>
<p>ACNML now utilizes approximate Bayesian inference to replace the exact Bayesian
posterior density with a tractable density $q(\theta)$. As many methods have
been proposed for approximate Bayesian inference in deep learning, we can
simply utilize any approximate posterior that provides tractable densities for
ACNML, though we focus on Gaussian approximate posteriors for simplicity and
computational efficiency. After computing the approximate posterior once during
training, the test-time optimization procedure becomes much simpler, as we only
need to optimize over our approximate posterior instead of the training set.
When we instantiate ACNML and initialize from the MAP solution, we find that it
typically takes only a handful of gradient updates to compute new (approximate)
optimal parameters for each label, resulting in much faster test-time inference
than a naive CNML instantiation that fine tunes using the whole training set.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/acnml/fig6_alg_summary.JPG" />
<br />
<i>
</i>
</p>
<p>In our <a href="https://arxiv.org/abs/2011.02696">paper</a>, we analyze the
approximation error incurred by using a particular Gaussian posterior in place
of the exact training data likelihoods, and show that under certain
assumptions, the approximation is accurate when the training set is large.</p>
<h2 id="experiments">Experiments</h2>
<p>We instantiate ACNML with two different Gaussian posterior approximations,
<a href="https://arxiv.org/abs/1902.02476">SWAG-Diagonal</a> and
<a href="https://openreview.net/forum?id=Skdvd2xAZ">KFAC-Laplace</a>
and train models on the CIFAR-10 image classification dataset. To evaluate
out-of-distribution performance, we then evaluate on the CIFAR-10 Corrupted
datasets, which apply a diverse range of common image corruptions at different
intensities, allowing us to see how well methods perform under varying levels
of distribution shift. We compare against methods using Bayesian
marginalization, which average predictions across different models sampled from
the approximate posterior. We note that all methods provide very similar
accuracy both in-distribution and out-of-distribution, so we focus on comparing
uncertainty estimates.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/acnml/fig7_reliability_diagrams.JPG" />
<br />
<i>
Figure 7: Reliability Diagrams comparing ACNML against the corresponding
Bayesian model averaging method (SWAGD) and the MAP solution (SWA). ACNML
generally predicts with lower confidence than other methods, leading to
comparatively better uncertainty estimation as the data becomes more
out-of-distribution.
</i>
</p>
<p>We first examine ACNML’s predictions using reliability diagrams, which
aggregate the test data points into buckets based on how confident the model’s
predictions are, then plot the average confidence in a bucket against the
actual accuracy of the predictions. These diagrams show the distribution of
predicted confidences and can capture how effectively a model’s confidence
reflects the actual uncertainty over the prediction.</p>
<p>As we would expect from our earlier discussion about CNML, we find that ACNML
reliably gives more conservative (less confident) predictions than other
methods, to the point where its predictions are actually <em>underconfident</em> on
the in-distribution CIFAR10 test set where all methods provide very accurate
predictions. However, on the out-of-distribution CIFAR10-C tasks where
classifier accuracy degrades, ACNML’s conservativeness provides much more
reliable confidence estimates, while other methods tend to be severely
overconfident.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/acnml/fig8_ece_curves.JPG" width="70%" />
<br />
<i>
Figure 8: ECE comparisons: We compare instantiations of ACNML with two
different approximate posteriors against their Bayesian counterparts.
</i>
</p>
<p>We quantitatively measure calibration using the
<a href="https://people.cs.pitt.edu/~milos/research/AAAI_Calibration.pdf">Expected Calibration Error</a>,
which uses the same buckets as the reliability diagrams and
computes the average calibration error (absolute difference between average
confidence and accuracy within the bucket) over all buckets. We see that ACNML
instantiations provide much better calibration than their Bayesian counterparts
and the deterministic baseline as the corruption intensities increase and the
data becomes more out-of-distribution.</p>
<h1 id="discussion">Discussion</h1>
<p>In this post, we discussed how we can obtain reliable uncertainty estimates on
out-of-distribution data by explicitly optimizing on the data we wish to make
predictions on instead of relying on trained models to extrapolate from the
training data. We then showed that this can be done concretely with the CNML
prediction strategy, a scheme that has been studied theoretically but is
computationally intractable to apply in practice. Finally we presented our
method, ACNML, a practical approximation to CNML that enables reliable
uncertainty estimation with deep neural networks. We hope that this line of
work will help enable broader applicability of large scale machine learning
systems, especially in safety-critical domains where uncertainty estimation is
a necessity.</p>
<hr />
<p>We thank Sergey Levine and Dibya Ghosh for providing valuable feedback on this post.</p>
<p>This post is based on this following paper:</p>
<ul>
<li>Aurick Zhou, Sergey Levine.<br />
<a href="https://arxiv.org/abs/2011.02696">Amortized Conditional Normalized Maximum Likelihood</a>.</li>
</ul>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:cnml" role="doc-endnote">
<p>For additional background on MDL and CNML, we refer to the following
<a href="https://mitpress.mit.edu/books/minimum-description-length-principle">textbook</a>.
The particular variant of CNML used here is referred to as CNML-3 in the text, and as sNML-1 in
<a href="https://ieeexplore.ieee.org/document/4601061">Roos et al</a>. <a href="#fnref:cnml" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:acnml" role="doc-endnote">
<p>Despite referring to maximum likelihood in the name, ACNML actually
approximates the CNMAP procedure, with CNML being a special case of CNMAP
by using a uniform prior as regularizer. In practice, due to the
over-conservativeness of CNML mentioned before, we will use non-uniform
priors (as used by the Bayesian deep learning methods we build off of). <a href="#fnref:acnml" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Mon, 16 Nov 2020 01:00:00 -0800
https://bairblog.github.io/2020/11/16/acnml/
https://bairblog.github.io/2020/11/16/acnml/Goodhart’s Law, Diversity and a Series of Seemingly Unrelated Toy Problems<meta name="twitter:title" content="Goodhart’s Law, Diversity and a Series of Seemingly Unrelated Toy Problems" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/ridge-rider/fig02.png" />
<p>Goodhart’s Law is an adage which states the following:</p>
<blockquote>
<p>“When a measure becomes a target, it ceases to be a good measure.”</p>
</blockquote>
<p>This is particularly pertinent in machine learning, where the source of many of
our greatest achievements comes from optimizing a target in the form of a loss
function. The most prominent way to do so is with stochastic gradient descent
(SGD), which applies a simple rule, follow the gradient:</p>
\[\theta_{t+1} = \theta_t - \alpha \nabla_\theta \mathcal{L}(\theta_t)\]
<p>For some step size $\alpha$. Updates of this form have led to a series of
breakthroughs from computer vision to reinforcement learning, and it is easy to
see why it is so popular: 1) it is relatively cheap to compute using backprop
2) it is guaranteed to locally reduce the loss at every step and finally 3) it
has an amazing track record empirically.</p>
<!--more-->
<p>However, we wouldn’t be writing this if SGD was perfect! In fact there are some
negatives. Most importantly, there is an intrinsic bias towards ‘easy’
solutions (typically associated with high negative curvature). In some cases,
two solutions with the same loss may be qualitatively different, and if one is
easier to find then it is likely to be the only solution found by SGD. This has
recently been referred to as a “shortcut” solution [1], examples of which are
below:</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/fig01.png" width="" />
<br />
</p>
<p>As we see, when classifying sheep, the network learns to use the green
background to identify the sheep present. When instead it is provided with an
image of sheep on a beach (which is an interesting prospect) then it fails
altogether. Thus, the key question motivating our work is the following:</p>
<blockquote>
<p>Question: How can we find a diverse set of different solutions?</p>
</blockquote>
<p>Our answer to this is to follow eigenvectors of the Hessian (‘ridges’) with
negative eigenvalues from a saddle, in what we call <u>Ridge Rider</u> (RR).
There is a lot to unpack in that statement, so we will go into more detail in
the following section.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/fig02.png" width="" />
<br />
</p>
<p>First, we assume we start at a saddle (green), where the norm of the gradient
is zero. We compute the eigenvectors \(\{ e_i(\theta) \}_{i=1}^d\) and
eigenvalues \(\{\lambda_i(\theta) \}_{i=1}^d\) of the $d$ dimensional Hessian,
which solve the following:</p>
\[\mathcal{H}(\theta) e_i(\theta) = \lambda_i(\theta) e_i(\theta), |e_i| = 1\]
<p>And we follow the eigenvectors with negative eigenvalues, which we call ridges.
We can follow these in both directions. As you see in the diagram, when we take
a step along the ridge (in red) we reach a new point. Now the gradient is the
step size multiplied by the eigenvalue and the eigenvector, because the
eigenvector was of the Hessian. Now we re-compute the spectrum, and select the
new ridge as the one with the highest inner product with the previous, to
preserve the direction. We then take a step along a new ridge, to
$\theta_{t+2}$.</p>
<p>So why do we do about this? Well, in the paper we show that if the inner
product between the new and the old ridge is greater than zero then we are
theoretically guaranteed to improve our loss. What this means is, RR provides
us with an <strong>orthogonal set of loss reducing directions</strong>. This is opposed to
SGD, which will almost always follow just one.</p>
<h1 id="the-full-picture">The full picture</h1>
<p>In the next diagram we show the full Ridge Rider algorithm.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/MIS.png" width="" />
<br />
</p>
<p>Now, clearly there are a lot of possible eigenvectors that we may have to
explore and, what is worse, due to the symmetries inherent in many optimization
problems, a large number of these eigenvectors will ultimately produce
equivalent solutions when followed. Luckily there is a way around this: If we
start at a saddle that is invariant under all of the symmetry transforms of the
loss function, then at this special saddle of the equivalent solutions have
<strong>identical</strong> eigenvalues. As such, at this saddle, which we refer to as the
<em>Maximally Invariant Saddle (MIS)</em>, the eigenvalues provide a natural
<strong>grouping</strong> for the possible solutions and we can start by exploring one
solution from each group.</p>
<p>From the MIS we branch by computing the spectrum of the Hessian via GetRidges,
and select a ridge to follow, which we update using the UpdateRidge method
until we reach a breaking point where we branch again via GetRidges. At this
point we can choose whether to continue along the current path or select
another ridge from the buffer. This is equivalent to choosing between breadth
first search of depth first search. Finally the leaves of the tree are the
solutions to our problem, each is uniquely defined by the fingerprint.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/gif-01.gif" width="" />
<br />
</p>
<p>On the positive side, RR provides us with a set of orthogonal locally
loss-reducing directions that can be used to span a tree of solutions. It
essentially turns optimization into a search problem, which allows us to
introduce new methods to use for finding solutions. We also benefit from the
natural ordering and grouping scheme provided by the eigenvalues (Fingerprint).</p>
<p>However, of course, there are many obvious questions that naturally arise with
this approach. Here we try to answer the FAQs:</p>
<blockquote>
<p>Q: This seems expensive! Don’t you need loads of samples due to the high variance of Hessian?</p>
</blockquote>
<p>A: Yes, that is fair! :( Nevertheless, we can use Hessian Vector Products to
make the computations tractable.</p>
<blockquote>
<p>Q: This seems expensive! Do you need to re-evaluate the full spectrum of the Hessian each timestep?</p>
</blockquote>
<p>A: Actually no! We present an approximate version of RR using Hessian Vector
Products. We will go into this next.</p>
<p>We use the Power/Lanczos method in GetRidges. In UpdateRidge, after each step
along the ridge, we find the new $e_i, \lambda_i$ pair by minimizing:</p>
\[L(e_i, \lambda_i ; \theta) = |(1/\lambda_i) \mathcal{H}(\theta) e_i / |e_i| - e_i/|e_i| |^2\]
<p>We warm-start with the 1st-order approximation to $\lambda(\theta)$, where
$\theta’, \lambda’, e_i’$ are the previous values:</p>
\[\lambda_i(\theta) \approx \lambda_i' + e_i' \delta \mathcal{H} e_i' =
\lambda_i' + e_i' (\mathcal{H}(\theta) - \mathcal{H}(\theta')) e_i'\]
<p>These terms only rely on Hessian vector products!</p>
<blockquote>
<p>Q: This seems expensive! Don’t you need to evaluate hundreds or thousands of branches?</p>
</blockquote>
<p>A: We actually don’t. We show in the paper that symmetries lead to repeated
Eigenvalues, which reduces the number of branches we need to explore.</p>
<p>A symmetry, $\phi$, of the loss function is a bijection on the parameter
space such that</p>
\[\mathcal{L}_\theta = \mathcal{L}_{\phi(\theta)}, \quad \mbox{for all} \quad \theta \in \Theta\]
<p>We show that in the presence of symmetries, the Hessian has repeated
eigenvalues. This means we only have to explore one from each set!</p>
<h1 id="rr-in-action-an-illustrative-example">RR in action: An illustrative example</h1>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/fig03.png" width="" />
<br />
</p>
<p>The Figure above shows a 2d cost surface, where we begin in the middle and want
to reach the blue areas. SGD always gets stuck in the valleys which correspond
to the locally steepest descent direction, this is shown by the circles. When
running RR, the first ridge also follows this direction, as we see in blue and
green. However, the second, orthogonal direction (brown and orange) avoids the
local optima and reaches the high value regions.</p>
<h1 id="ridge-rider-for-exploration-in-reinforcement-learning">Ridge Rider for Exploration in Reinforcement Learning</h1>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/fig04.png" width="" />
<br />
</p>
<p>We tested RR in the tabular RL setting, where we sought to find diverse
solutions to a tree-based exploration task. We generated trees like the one
above, which has positive or negative rewards at the leaves. In this case we
see it is much easier to find the positive reward on the left, corresponding to
a policy which goes left at $s_1$ and left at $s_2$. To find the solution at
the bottom (going left from $s_6$) requires avoiding several negative rewards.</p>
<p>To rigorously evaluate RR, we generated 20 trees for four different depths, and
ran the algorithm each time, comparing against Gradient Descent starting from
random initializations or the MIS, and random vectors from the MIS. The results
show that RR on average finds almost all the solutions, while the other methods
fail to even find half.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/fig05.png" width="" />
<br />
</p>
<p>In the paper we include additional ablations, and a first foray into
sample-based RL. We encourage you to check it out.</p>
<h1 id="ridge-rider-for-supervised-learning">Ridge Rider for Supervised Learning</h1>
<p>We wanted to test the approximate RR algorithm in the simplest possible
setting, which naturally brought us to MNIST, the canonical ML dataset! We used
the approximate version to train a neural network with two 128-unit hidden
layers, and surprisingly we were able to get 98% accuracy. This clearly isn’t a
new SoTA for computer vision, but we think it is a nice result which shows the
possible scalability of our algorithm.</p>
<p>Interestingly, it seems the individual ridges correspond to learning different
features. In the next Figure, we show the performance for a classifier trained
by following each ridge individually.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/fig06.png" width="" />
<br />
</p>
<p>As we see, the earlier ridges correspond to learning 0 and 1, while the later
ones learn to classify the digit 8.</p>
<p>This provides further evidence that the Hessian contains structure which may
relate to causal information about the problem. Next we further develop this by
looking at out-of-distribution generalization.</p>
<h1 id="ridge-rider-for-out-of-distribution-generalization">Ridge Rider for Out of Distribution Generalization</h1>
<p>We tested RR on the colored MNIST dataset, from [2]. Colored MNIST was
specifically designed to test causality, as each image is colored either red or
green in a way that correlates strongly (but spuriously) with the class label.
By construction, the label is more strongly correlated with the color than with
the digit. Any algorithm purely minimizing training error will tend to exploit
the color.</p>
<p>In the next Figure, we see that ERM (greedily optimizing the loss at training
time) massively overfits the problem, and does poorly at test time. By
contrast, RR achieves a respectable 58%, not too far from the 66% achieved by
the state-of-the-art causal approach.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/fig07.png" width="" />
<br />
</p>
<h1 id="ridge-rider-for-zero-shot-co-ordination">Ridge Rider for Zero-Shot Co-ordination</h1>
<p>Finally, we consider the zero-shot coordination (ZSC) problem [3], in which the
goal is to find a training strategy that allows the two halves of independently
trained joint policies to coordinate successfully on their first encounter
(i.e. in <em>cross-play</em>). Zero-shot coordination is a great proxy for human-AI
coordination, since it naturally regularizes policies to those that an agent
can expect an <em>independently</em> trained agent to also discover, without prior
coordination. Crucially though, it is formulated such that it does not require
human data during the training process.</p>
<p>The challenge is that while it is easy to assess the failure of coordination
after training has finished, we are not allowed to directly optimize for
coordination success, since the policies need to be trained entirely
independently. Consequently, there currently is no general algorithm that
achieves perfect zero-shot coordination. Other-Play [3] is a step in this
direction, but requires that the symmetries of the underlying problem are known
ahead of time.</p>
<p>In contrast, since RR can take advantage of the underlying symmetries of an
optimization problem, it can naturally be used to solve ZSC, as we illustrate
with an adapted version of the lever game from [3], shown below:</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/fig08.png" width="" />
<br />
</p>
<p>Recall from before that symmetries lead to repeated eigenvalues at the MIS.
Clearly, shuffling all the 1.0 levers (and exchanging the two 0.8 levers)
leaves the game unchanged. Therefore, at the MIS, there is an entire
<em>eigenspace</em> for the eigenvalue associated with all of the 1.0 solutions.
Crucially, the <em>ordering</em> of the eigenvectors within this eigenspace is
arbitrary across different optimization runs. Each of these different
eigenvectors will typically leads to an <em>equivalent</em> but <em>mutually
incompatible</em> solution.</p>
<p>In contrast, there is a unique eigenvalue associated with the 0.6 solution
which is ranked consistently across different optimization runs. This
illustrates how RR can be used for ZSC: We need to first run the RR algorithm
independently a large number of times, and then compute the <em>cross-play score</em>
across the different runs for all of the ridges. Finally, at test time we play
the strategy from the ridges with the highest average cross-play score.</p>
<p>This process is illustrated below for the lever game, where indeed the 0.6
solution is consistently found with the first ridge across all runs and thus
leads to the highest cross-play score of 0.6. This happens to be the optimal
ZSC solution for the lever game. We also show the result for three independent
runs below for illustration purposes.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/fig09.png" width="" />
<br />
</p>
<p>On the right, we see the first ridge is always the same action, which
corresponds to the optimal zero-shot solution. The next two are a 50-50 bet on
the two 0.8 ridges. The remaining ridges are largely a jumbled up mess,
corresponding to the levers with symmetries. As a reminder, self-play will
almost certainly converge on one of the arbitrary 1.0 solutions, leading to
poor cross-play.</p>
<p>Now, you might be concerned that to find the MIS in the first place we needed
to know the symmetries of the problem, an assumption we have been working to
avoid. Once again, we got lucky: As we show in the paper, in RL problems the
MIS can be learned by simply minimizing the gradient norm while maximizing the
entropy. This will force the policy to place the same probability mass on all
equivalent actions, thus making it invariant under all symmetries, while
getting close to a saddle!</p>
<h1 id="summary-and-future-work">Summary and Future Work</h1>
<p>A gif speaks a thousand words:</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/ridge-rider/RR_FutureWork.gif" width="" />
<br />
</p>
<p><strong>Paper</strong>:</p>
<ul>
<li><strong>Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian</strong>.<br />
Jack Parker-Holder, Luke Metz, Cinjon Resnick, Hengyuan Hu, Adam Lerer, Alistair Letcher, Alex Peysakhovich, Aldo Pacchiano, Jakob Foerster. NeurIPS 2020.<br />
<a href="https://proceedings.neurips.cc/paper/2020/file/08425b881bcde94a383cd258cea331be-Paper.pdf">Paper Link</a></li>
</ul>
<p><strong>Code</strong>:</p>
<ul>
<li>RL: <a href="https://bit.ly/2XvEmZy">https://bit.ly/2XvEmZy</a></li>
<li>ZS Co-ordination: <a href="https://bit.ly/308j2uQ">https://bit.ly/308j2uQ</a></li>
<li>OOD Generalization: <a href="https://bit.ly/3gWeFsH">https://bit.ly/3gWeFsH</a></li>
</ul>
<p><strong>References</strong>:</p>
<ul>
<li>[1] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, Felix A. Wichmann (2020) <strong>Shortcut Learning in Deep Neural Networks.</strong></li>
<li>[2] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, David Lopez-Paz (2019) <strong>Invariant Risk Minimization.</strong> Arxiv pre-print.</li>
<li>[3] Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. <strong>“Other-Play” for zero-shot coordination.</strong> <em>International Conference on Machine Learning (ICML)</em>. 2020</li>
</ul>
Fri, 13 Nov 2020 01:00:00 -0800
https://bairblog.github.io/2020/11/13/ridge-rider/
https://bairblog.github.io/2020/11/13/ridge-rider/Adapting on the Fly to Test Time Distribution Shift<meta name="twitter:title" content="Adapting on the Fly to Test Time Distribution Shift" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/arm/methods.png" />
<p>Imagine that you are building the next generation machine learning model for handwriting transcription. Based on previous iterations of your product, you have identified a key challenge for this rollout: after deployment, new end users often have different and unseen handwriting styles, leading to <em>distribution shift</em>. One solution for this challenge is to learn an <em>adaptive</em> model that can specialize and adjust to each user’s handwriting style over time. This solution seems promising, but it must be balanced against concerns about ease of use: requiring users to provide feedback to the model may be cumbersome and hinder adoption. Is it possible instead to learn a model that can adapt to new users <em>without labels</em>?</p>
<!--more-->
<p>In many scenarios, including this example, the answer is “yes”. Consider the ambiguous example shown enlarged in the figure below. Is this character a “2” with a loop or a <a href="https://en.wikipedia.org/wiki/A#English">double-storey “a”</a>? For a non adaptive model that pays attention to the biases in the training data, the reasonable prediction would be “2”. However, even without labels, we can extract useful information from the user’s other examples: an adaptive model, for example, can observe that this user has written “2”s without loops and conclude that this character is thus more likely to be “a”.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/arm/intro.gif" width="" />
<br />
</p>
<p>Handling the distribution shift that arises from deploying a model to new users is an important motivating example for unlabeled adaptation. But, this is far from the only example. In an ever-changing world, autonomous cars need to adapt to new weather conditions and locations, image classifiers need to adapt to new cameras with different intrinsics, and recommender systems need to adapt to users’ evolving preferences. Humans have demonstrated the ability to <a href="http://pages.cs.wisc.edu/~jerryzhu/pub/tie.pdf">adapt without labels</a> by inferring information from the distribution of test examples. Can we develop methods that can allow machine learning models to do the same?</p>
<p>This question has enjoyed growing attention from researchers, with a number of recent works proposing methods for unlabeled test time adaptation. In this post, I will survey these works as well as other prominent frameworks for handling distribution shift. With this broader context in mind, I will then discuss our recent work (see the paper <a href="https://arxiv.org/abs/2007.02931">here</a> and the code <a href="https://github.com/henrikmarklund/arm">here</a>), in which we propose a problem formulation that we term <strong>adaptive risk minimization</strong>, or ARM.</p>
<h2 id="diving-into-distribution-shift">Diving into Distribution Shift</h2>
<p>The vast majority of work in machine learning follows the canonical framework of <strong>empirical risk minimization</strong>, or ERM. ERM methods assume that there is no distribution shift, so the test distribution exactly matches the training distribution. This assumption simplifies the development and analysis of powerful machine learning methods but, as discussed above, is routinely violated in real-world applications. To move beyond ERM and learn models that generalize in the face of distribution shift, we must introduce additional assumptions. However, we must carefully choose these assumptions such that they are still realistic and broadly applicable.</p>
<p>How do we maintain realism and applicability? One answer is to model the assumptions on the conditions that machine learning systems face in the real world. For example, in the ERM setting, models are evaluated on each test point one at a time, but in the real world, these test points are often available sequentially or in <em>batches</em>. For handwriting transcription, for example, we can imagine collecting entire sentences and paragraphs from new users. If there is distribution shift, observing multiple test points can be useful either to infer the test distribution or otherwise adapt the model to this new distribution, even in the absence of labels.</p>
<p>Many recent methods that use this assumption can be classified as <strong>test time adaptation</strong>, including <a href="https://arxiv.org/abs/1603.04779">batch normalization</a>, <a href="https://arxiv.org/abs/1802.03916">label shift estimation</a>, <a href="https://arxiv.org/abs/1909.13231">rotation prediction</a>, <a href="https://arxiv.org/abs/2006.10726">entropy minimization</a>, and more. Oftentimes, these methods build in strong inductive biases that enable useful adaptation; for example, rotation prediction is well aligned with many image classification tasks. But these methods generally either propose heuristic training procedures or do not consider the training procedure at all, relying instead on pretrained models.<sup id="fnref:pretrained" role="doc-noteref"><a href="#fn:pretrained" class="footnote">1</a></sup> This begs the question: can test time adaptation be further enhanced by improved training, such that the model can make better use of the adaptation procedure?</p>
<p>We can gain insight into this question by investigating other prominent frameworks for handling distribution shift and, in particular, the assumptions these frameworks make. In real-world applications, the training data generally does not consist only of input label pairs; instead, there are additional <em>meta-data</em> associated with each example, such as time and location, or the particular user in the handwriting example. These meta-data can be used to organize the training data into <em>groups</em>,<sup id="fnref:groups" role="doc-noteref"><a href="#fn:groups" class="footnote">2</a></sup> and a common assumption in a number of frameworks is that the test time distribution shifts represent either new group distributions or new groups altogether. This assumption still allows for a wide range of realistic distribution shifts and has driven the development of numerous practical methods.</p>
<p>For example, <strong>domain adaptation</strong> methods typically assume access to two training groups: source and target data, with the latter being drawn from the test distribution. Thus, these methods augment training to focus on the target distribution, such as through <a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.370.4921&rep=rep1&type=pdf">importance</a> <a href="http://sifaka.cs.uiuc.edu/czhai/pub/acl07.pdf">weighting</a> or learning <a href="https://arxiv.org/abs/1505.07818">invariant</a> <a href="https://arxiv.org/abs/1702.05464">representations</a>. Methods for <a href="http://papers.neurips.cc/paper/3019-mixture-regression-for-covariate-shift.pdf"><strong>group</strong></a> <a href="https://arxiv.org/abs/1611.02041"><strong>distributionally robust</strong></a> <a href="https://arxiv.org/abs/1911.08731"><strong>optimization</strong></a> and <a href="https://papers.nips.cc/paper/4312-generalizing-from-several-related-classification-tasks-to-a-new-unlabeled-sample"><strong>domain</strong></a> <a href="https://arxiv.org/abs/2007.01434"><strong>generalization</strong></a> do not directly assume access to data from the test distribution, but instead use data drawn from multiple training groups in order to learn a model that generalizes at test time to new groups (or new group distributions). So, these prior works have largely focused on the training procedure and generally do not adapt at test time (despite the name “domain adaptation”).</p>
<h2 id="combining-training-and-test-assumptions">Combining Training and Test Assumptions</h2>
<p>Prior frameworks for distribution shift have assumed either training groups or test batches, but we are not aware of any prior work that uses both assumptions. In our work, we demonstrate that it is precisely this conjunction that allows us to <em>learn to adapt</em> to test time distribution shift, by simulating both the shift and the adaptation procedure at training time. In this way, our framework can be understood as a <strong>meta-learning</strong> framework, and we refer interested readers to this <a href="https://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/">blog post</a> for a detailed overview of meta-learning.</p>
<h3 id="adaptive-risk-minimization">Adaptive Risk Minimization</h3>
<p>Our work proposes <a href="https://arxiv.org/abs/2007.02931">adaptive risk minimization</a>, or ARM, which is a problem setting and objective that makes use of both groups at training time and batches at test time. This synthesis provides a general and principled answer, through the lens of meta-learning, to the question of how to train for test time adaptation. In particular, we <em>meta-train</em> the model using simulated distribution shifts, which is enabled by the training groups, such that it exhibits strong <em>post-adaptation</em> performance on each shift. The model therefore directly learns how to best leverage the adaptation procedure, which it then executes in the exact same way at test time. If we can identify which test distribution shifts are likely, such as seeing data from new end users, then we can better construct simulated training shifts, such as sampling data from only one particular training user.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/arm/arm.gif" width="" />
<br />
</p>
<p>The training procedure for optimizing the ARM objective is illustrated in the graphic above. From the training data, we sample different batches that simulate different group distribution shifts. An <em>adaptation model</em> then has the opportunity to adapt the model parameters using the unlabeled examples. This allows us to meta-train the model for post-adaptation performance by directly performing gradient updates on both the model and the adaptation model.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/arm/methods.png" width="" />
<br />
<i>
We draw inspiration from contextual meta-learning (left) and gradient based meta-learning (right) in order to devise methods for ARM. For contextual meta-learning, we investigate two different methods that fall under this category. These methods are described in detail in <a href="https://arxiv.org/abs/2007.02931">our paper</a>.
</i>
</p>
<p>The connection to meta-learning is one key advantage of the ARM framework, as we are not starting from scratch when devising methods for solving ARM. In our work in particular, we draw inspiration from both <a href="https://arxiv.org/abs/1807.01613">contextual meta-learning</a> and <a href="https://arxiv.org/abs/1703.03400">gradient based meta-learning</a> to develop three methods for solving ARM, which we name ARM-CML, ARM-BN, and ARM-LL. We omit the details of these methods here, but they are illustrated in the figure above and described in full in <a href="https://arxiv.org/abs/2007.02931">our paper</a>.</p>
<p>The diversity of methods that we construct demonstrate the versatility and generality of the ARM problem formulation. But do we actually observe empirical gains using these methods? We investigate this question next.</p>
<h3 id="experiments">Experiments</h3>
<p>In our experiments, we first conducted a thorough study of the proposed ARM methods compared to various baselines, prior methods, and ablations, on four different image classification benchmarks exhibiting group distribution shift. <a href="https://arxiv.org/abs/2007.02931">Our paper</a> provides full details on the benchmarks and comparisons.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/arm/results.png" width="" />
<br />
<i>
We found that ARM methods empirically resulted in both better worst case (WC) and average (Avg) performance across groups compared to prior methods, indicating both better robustness and performance from the final trained models.
</i>
</p>
<p>In our main study, we found that ARM methods do better across the board both in terms of worst case and average test performance across groups, compared to a number of prior methods along with other baselines and ablations. The simplest method of ARM-BN, which can be implemented in just a few lines of additional code, often performed the best. This empirically shows the benefits of meta-learning, in that the model can be meta-trained to take greater advantage of the adaptation procedure.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/arm/femnist.gif" width="60%" />
<br />
</p>
<p>We also conducted some qualitative analyses, in which we investigated a test situation similar to the motivating example described at the beginning with a user that wrote double-storey a’s. We empirically found that models trained with ARM methods did in fact successfully adapt and predict “a” in this situation, when given enough examples of the user’s handwriting that included other “a”s and “2”s. Thus, this confirms our original hypothesis that training adaptive models is an effective way to deal with distribution shift.</p>
<p>We believe that the motivating example from the beginning as well as the empirical results in our paper convincingly argue for further study into general techniques for <em>adaptive models</em>. We have presented a general scheme for meta-training these models to better harness their adaptation capabilities, but a number of open questions remain, such as devising better adaptation procedures themselves. This broad research direction will be crucial for machine learning models to truly realize their potential in complex, real-world environments.</p>
<hr />
<p>Thanks to Chelsea Finn and Sergey Levine for providing valuable feedback on this post.</p>
<p>Part of this post is based on the following paper:</p>
<ul>
<li>Marvin Zhang*, Henrik Marklund*, Nikita Dhawan*, Abhishek Gupta, Sergey Levine, Chelsea Finn.<br />
<a href="https://arxiv.org/abs/2007.02931"><strong>Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift.</strong></a><br />
<a href="https://sites.google.com/view/adaptive-risk-minimization">Project webpage</a><br />
<a href="https://github.com/henrikmarklund/arm">Open source code</a></li>
</ul>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:pretrained" role="doc-endnote">
<p>On the flip side, applicability to even pretrained models can be seen as a strength of these methods. <a href="#fnref:pretrained" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:groups" role="doc-endnote">
<p>Alternatively referred to as domains, subpopulations, tasks, users, and more. <a href="#fnref:groups" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Thu, 05 Nov 2020 01:00:00 -0800
https://bairblog.github.io/2020/11/05/arm/
https://bairblog.github.io/2020/11/05/arm/Reinforcement learning is supervised learning on optimized data<meta name="twitter:title" content="Reinforcement learning is supervised learning on optimized data" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/supervised_rl/hipi.png" />
<p>The two most common perspectives on Reinforcement learning (RL) are <strong>optimization</strong> and <strong>dynamic programming</strong>. Methods that compute the gradients of the non-differentiable expected reward objective, such as the REINFORCE trick are commonly grouped into the optimization perspective, whereas methods that employ TD-learning or Q-learning are dynamic programming methods. While these methods have shown considerable success in recent years, these methods are still quite challenging to apply to new problems. In contrast deep supervised learning has been extremely successful and we may hence ask: <em>Can we use supervised learning to perform RL?</em></p>
<p>In this blog post we discuss a mental model for RL, based on the idea that RL can be viewed as doing supervised learning on the “good data”. What makes RL challenging is that, unless you’re doing imitation learning, actually acquiring that “good data” is quite challenging. Therefore, RL might be viewed as a <em>joint optimization</em> problem over both the policy and the data. Seen from this <strong>supervised learning</strong> perspective, many RL algorithms can be viewed as alternating between finding good data and doing supervised learning on that data. It turns out that finding “good data” is much easier in the multi-task setting, or settings that can be converted to a different problem for which obtaining “good data” is easy. In fact, we will discuss how techniques such as hindsight relabeling and inverse RL can be viewed as optimizing data.</p>
<!--more-->
<p>We’ll start by reviewing the two common perspectives on RL, optimization and dynamic programming. We’ll then delve into a formal definition of the supervised learning perspective on RL.</p>
<h2 id="common-perspectives-on-rl">Common Perspectives on RL</h2>
<p>In this section, we will describe the two predominant perspectives on RL.</p>
<h3 id="optimization-perspective">Optimization Perspective</h3>
<p>The optimization perspective views RL as a special case of optimizing non-differentiable functions. Recall that the expected reward is a function of the parameters $\theta$ of a policy $\pi_\theta$:</p>
\[J(\theta) = \mathbb{E}_{s' \sim p(s' \mid s, a), a \sim \pi(a \mid s)} \left[\sum_t \gamma^t r(s_t, a_t) \right].\]
<p>This function is complex and usually non-differentiable and unknown, as it depends on both the actions chosen by the policy and the dynamics of the environment. While we can estimate the gradient using the REINFORCE trick, this gradient depends on the policy parameters and on-policy data, which is generated from the simulator by running the current policy.</p>
<h3 id="dynamic-programming-perspective">Dynamic Programming Perspective</h3>
<p>The dynamic programming perspective says that optimal control is a problem of choosing the right action at each step. In discrete settings with known dynamics, we can solve this dynamic programming problem exactly. For example, Q-learning estimates the state-action values, $Q(s, a)$ by iterating the following updates:</p>
\[Q(s, a) \gets r(s, a) + \gamma \max_{a'} \mathbb{E}_{s' \sim p(s' \mid s, a)}[Q(s', a')].\]
<p>In continuous spaces or settings with large state and action spaces, we can <em>approximate</em> dynamic programming by representing the Q-function using a function approximator (e.g., a neural network) and minimizing the difference the TD error, which is the squared-difference between the LHS and RHS in the equation above:</p>
\[TD(\theta) = \frac{1}{2}(Q_\theta(s, a) - y(s, a))^2,\]
<p>where the <em>TD target</em> $y(s, a) = r(s, a) + \gamma \max_{a’} Q_\theta(s’, a’)$. Note that this is a loss function for the Q-function, instead of the policy.</p>
<p>This approach allows us to use any kind of data for optimizing the Q-function, therefore preventing the need to have “good” data, but it suffers from major optimization issues and can diverge or converge to poor solutions and can be hard to apply to new problems.
<!-- Unlike the optimization perspective, the dynamic programming perspective is straightforward to apply to data collected from a different policy. However, a major limitation is that it's unclear whether reducing the TD error actually corresponds to increasing reward. In reality, the TD error is often a poor proxy for expected reward. For example, during training, the TD error often *increases* when the expected reward increases. As another example, though residual TD learning (i.e., taking gradients through the TD target) [^Baird95] is much more effective at minimizing the TD error, it results in substantially worse policies. --></p>
<h2 id="supervised-learning-perspective">Supervised Learning Perspective</h2>
<p>We now discuss another mental model for RL. The main idea is to view RL as a <em>joint</em> optimization problem over the policy and experience: we simultaneously want to find both “good data” and a “good policy.” Intuitively, we expect that “good” data will (1) get high reward, (2) sufficiently explore the environment, and (3) be at least somewhat representative of our policy. We define a good policy as simply a policy that is likely to produce good data.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/supervised_rl/supervised_perspective.png" width="90%" />
<br />
<i>Figure 1: Many old and new reinforcement learning algorithms can be viewed as doing
behavior cloning (a.k.a. supervised learning) on optimized data. This blog post
discusses recent work that extends this idea to the multi-task perspective,
where it actually becomes *easier* to optimize data.</i>
</p>
<p>Converting “good data” into a “good policy” is easy: just do supervised learning! The reverse direction, converting a “good policy” into “good data” is slightly more challenging, and we’ll discuss a few approaches in the next section. It turns out that in the multi-task setting or by artificially modifying the problem definition slightly, converting a “good policy” into “good data” is substantially easier. The penultimate section will discuss how goal relabeling, a modified problem definition, and inverse RL extract “good data” in the multi-task setting.</p>
<h4 id="an-rl-objective-that-decouples-the-policy-from-data">An RL objective that decouples the policy from data</h4>
<p>We now formalize the supervised learning perspective using the lens of expectation maximization, a lens used in many prior works [<a href="http://www.cs.toronto.edu/~fritz/absps/dh97.pdf">Dayan 1997</a>, <a href="http://is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/ICML2007-Peters_4493[0].pdf">Williams 2007</a>, <a href="https://www.ias.informatik.tu-darmstadt.de/uploads/Team/JanPeters/Peters2010_REPS.pdf">Peters 2010</a>, <a href="http://eprints.lincoln.ac.uk/25793/1/441_icmlpaper.pdf">Neumann 2011</a>, <a href="https://papers.nips.cc/paper/5178-variational-policy-search-via-trajectory-optimization.pdf">Levine 2013</a>]. To simplify notation, we will use $\pi_\theta(\tau)$ as the probability that policy $\pi_\theta$ produces trajectory $\tau$, and will use $q(\tau)$ to denote the data distribution that we will optimize. Consider the log of the expected reward objective, $\log J(\theta)$. Since log function is monotonic increasing, maximizing this is equivalent to maximizing the expected reward. We then apply Jensen’s inequality to move the logarithm inside the expectation:</p>
\[\begin{aligned}
\log J(\theta) &= \log \mathbb{E}_{\pi(\tau)} \left[R(\tau) \right] \\
& \ge \mathbb{E}_{q(\tau)} \left[ \log R(\tau) + \log \pi_\theta(\tau) - \log q(\tau) \right] := F(\theta, q)
\end{aligned}\]
<p>What’s useful about this lower bound is that it allows us to optimize a policy using data sampled from a different policy. This lower bound makes explicit the fact that RL is a joint optimization problem over the policy and experience. The table below compares the supervised learning perspective to the optimization and dynamic programming perspectives:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Optimization Perspective</th>
<th>Dynamic Programming Perspective</th>
<th>Supervised Learning Perspective</th>
</tr>
</thead>
<tbody>
<tr>
<td>What are we optimizing?</td>
<td>policy ($\pi_\theta$)</td>
<td>Q-function ($Q_\theta$)</td>
<td>policy ($\pi_\theta$) and data ($q(\tau)$)</td>
</tr>
<tr>
<td>Loss</td>
<td>Surrogate loss <br /> $\tilde{L}(\theta, \tau \sim \pi_\theta)$</td>
<td>TD error</td>
<td>Lower bound <br /> $F(\theta, q)$</td>
</tr>
<tr>
<td>Data used in loss</td>
<td>collected from current policy</td>
<td>arbitrary</td>
<td>optimized data</td>
</tr>
</tbody>
</table>
<p>Finding good data and a good policy correspond to optimizing the lower bound, $F(\theta, q)$, with respect to the policy parameters and the experience. One common approach for maximizing the lower bound is to perform coordinate ascent on its arguments, alternating between optimizing the data distribution and the policy.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup></p>
<h4 id="optimizing-the-policy">Optimizing the Policy</h4>
<p>When optimizing the lower bound with respect to the policy, the objective is (up to a constant) exactly equivalent to supervised learning (a.k.a. behavior cloning)!</p>
\[\max_\theta F(\theta, q) = \max_\theta \mathbb{E}_{\tau \sim q(\tau)} \left[\sum_{s_t, a_t \in \tau} \log \pi_\theta(a_t \mid s_t) \right] + \text{const.}\]
<p>This observation is exciting because supervised learning is generally much more stable than RL algorithms<sup id="fnref:stable" role="doc-noteref"><a href="#fn:stable" class="footnote">2</a></sup>. Moreover, this observation suggests that prior RL methods that use supervised learning as a subroutine[<a href="https://arxiv.org/pdf/1806.05635">Oh20 18</a>, <a href="http://papers.nips.cc/paper/9667-goal-conditioned-imitation-learning.pdf">Ding 2019</a>] might actually be optimizing a lower bound on expected reward.</p>
<h4 id="optimizing-the-data-distribution">Optimizing the Data Distribution</h4>
<p>The objective for the data distribution is to maximize reward while not deviating too far from the current policy.</p>
\[\max_q F(\theta, q) = \max_q \mathbb{E}_{q(\tau)} \left[ \log R(\tau) \right] - D_\text{KL}\left(q(\tau) \; \| \; \pi(\tau) \right).\]
<p>The KL constraint above makes the optimization of the data distribution conservative, preferring to stay close to the current policy at the cost of slightly lower reward. Optimizing the expected <em>log</em> reward, rather than the expected reward, further makes this optimization problem risk averse (the $\log(\cdot)$ function is a concave utility function[^Ingersoll19]).</p>
<p>There are a number of ways we might optimize the data distribution. One straightforward (if inefficient) strategy is to collect experience with a noisy version of the current policy, and keep the 10% of experience that receives the highest reward.[^Oh18] An alternative is to do trajectory optimization, optimizing the states along a single trajectory.[<a href="http://eprints.lincoln.ac.uk/25793/1/441_icmlpaper.pdf">Neumann 2011</a>, <a href="https://papers.nips.cc/paper/5178-variational-policy-search-via-trajectory-optimization.pdf">Levine 2013</a>] A third approach is to <em>not</em> collect more data, but rather reweight previous collected trajectories by their reward. [<a href="http://www.cs.toronto.edu/~fritz/absps/dh97.pdf">Dayan1997</a>] Moreover, the data distribution $q(\tau)$ can be represented in multiple ways – as a non-parametric discrete distribution over previously-observed trajectories[<a href="https://arxiv.org/pdf/1806.05635">Oh 2018</a>], or a factored distribution over individual state-action pairs [<a href="http://eprints.lincoln.ac.uk/25793/1/441_icmlpaper.pdf">Neumann 2011</a>, <a href="https://papers.nips.cc/paper/5178-variational-policy-search-via-trajectory-optimization.pdf">Levine 2013</a>] or as a semi-parametric model that extends observed experience with extra hallucinated experience generated from a parametric model.[<a href="https://arxiv.org/pdf/1912.13464">Kumar 2019</a>]</p>
<h4 id="viewing-prior-work-through-the-lens-of-supervised-learning">Viewing Prior Work through the Lens of Supervised Learning</h4>
<p>A number of algorithms perform these steps in disguise. For example, reward-weighted regression [<a href="http://is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/ICML2007-Peters_4493[0].pdf">Williams 2007</a>] and advantage-weighted regression [<a href="http://papers.nips.cc/paper/3501-fitted-q-iteration-by-advantage-weighted-regression.pdf">Neumann 2009</a>, <a href="https://arxiv.org/pdf/1910.00177">Peng 2019</a>] combine the two steps by doing behavior cloning on reward-weighted data. Self-imitation learning [<a href="https://arxiv.org/pdf/1806.05635">Oh 2018</a>] forms the data distribution by ranking observed trajectories according to their reward and choosing a uniform distribution over the top-k. MPO [<a href="https://arxiv.org/pdf/1806.06920">Abdolmaleki 2018</a>] constructs a dataset by sampling actions from the policy, reweights those actions that are expected to lead to high reward (i.e., have high reward plus value), and then performs behavior cloning on those reweighted actions.</p>
<h3 id="multi-task-versions-of-the-supervised-learning-perspective">Multi-Task Versions of the Supervised Learning Perspective</h3>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/supervised_rl/hipi.png" width="90%" />
<br />
<i>Figure 2: A number of recent multi-task RL algorithms organize experience
based on what task each piece of experience solved. This process of post-hoc
organization is closely related to hindsight relabeling and inverse RL, and lies
at the core of recent multi-task RL algorithms that are based on supervised
learning.</i>
</p>
<p>A number of recent algorithms can be viewed as reincarnations of this idea, with a twist. The twist is that finding good data becomes much easier in the multi-task setting. These works typically either operate directly in a multi-task setting or modify the single-task setting to look like one. As we increase the number of tasks, all experience becomes optimal for some task. We now view three recent papers through this lens:</p>
<p><strong>Goal-conditioned imitation learning</strong>:[<a href="https://arxiv.org/pdf/1803.00653">Savinov 2018</a>, <a href="https://arxiv.org/pdf/1912.06088.pdf">Ghosh 2019</a>, <a href="http://papers.nips.cc/paper/9667-goal-conditioned-imitation-learning.pdf">Ding 2019</a>, <a href="http://proceedings.mlr.press/v100/lynch20a/lynch20a.pdf">Lynch 2020</a>] In a goal-reaching task our data distribution consists of both the states and actions, as well as the attempted goal. As a robot’s failure to reach a commanded goal is nonetheless a success for reaching the goal it actually reached, we can optimize the data distribution by replacing the originally commanded goals with the goals actually reached. Thus, the hindsight relabelling performed by goal-conditioned imitation learning [<a href="https://arxiv.org/pdf/1803.00653">Savinov 2018</a>, <a href="https://arxiv.org/pdf/1912.06088.pdf">Ghosh 2019</a>, <a href="http://papers.nips.cc/paper/9667-goal-conditioned-imitation-learning.pdf">Ding 2019</a>, <a href="http://proceedings.mlr.press/v100/lynch20a/lynch20a.pdf">Lynch 2020</a>] and hindsight experience replay [<a href="https://papers.nips.cc/paper/7090-hindsight-experience-replay.pdf">Andrychowicz 2017</a>] can be viewed as optimizing a non-parametric data distribution. Moreover, goal-conditioned imitation can be viewed as simply doing supervised learning (a.k.a behavior cloning) on optimized data. Interestingly, when this goal-conditioned imitation procedure with relabeling is repeated iteratively, it can be shown that this is a convergent procedure for learning policies from scratch, even if no expert data is provided at all! [<a href="https://arxiv.org/pdf/1912.06088.pdf">Ghosh 2018</a>] This is particularly promising because it essentially provides us a technique for off-policy RL without explicitly requiring any bootstrapping or value function learning, significantly simplifying the algorithm and tuning process.</p>
<p><strong>Reward-Conditioned Policies</strong>:[<a href="https://arxiv.org/pdf/1912.13465">Kumar 2019</a>, <a href="https://arxiv.org/pdf/1912.02877">Srivastava 2019</a>] Interestingly, we can the extend the insight discussed above to single-task RL, if we can view non-expert trajectories collected from sub-optimal policies as optimal supervision for some family of tasks. Of course, these sub-optimal trajectories may not maximize reward, but they are optimal for matching the reward of the given trajectory. Thus, we can modify the policy to be conditioned on a desired value of long-term reward (i.e., the return) and follow a similar strategy as goal-conditioned imitation learning: execute rollouts using this reward-conditioned policy by commanding a desired value of return, relabel the commanded return values to the observed returns, which gives us optimized data non-parametrically, and finally, run supervised learning on this optimized data. We show [<a href="https://arxiv.org/pdf/1912.13465">Kumar 2019</a>] that by simply optimizing the data in a non-parametric fashion via simple re-weighting schemes, we can obtain an RL method that is guaranteed to converge to the optimal policy and is simpler than most RL methods in that it does not require parametric return estimators which might be hard to tune.</p>
<p><strong>Hindsight Inference for Policy Improvement</strong>:[<a href="https://arxiv.org/abs/2002.11089">Eysenbach 2020</a>] While the connections between goal-reaching algorithms and dataset optimization are neat, until recently it was unclear how to apply similar ideas to more general multi-task settings, such as a discrete set of reward functions or sets of reward defined by varying (linear) combinations of bonus and penalty terms. To resolve this open question, we started with the intuition that optimizing the data distribution corresponds to answering the following question: “if you assume that your experience was optimal, what tasks were you trying to solve?” Intriguily, this is precisely the question that <em>inverse RL</em> answers. This suggests that we can simply use inverse RL to relabel data in <em>arbitrary</em> multi-task settings: inverse RL provides a theoretically grounded mechanism for sharing experience across tasks. This result is exciting for two reasons:</p>
<ol>
<li>This result tells us how to apply similar relabeling ideas to more general multi-task settings. Our experiments showed that relabeling experience using inverse RL accelerates learning across a wide range of multi-task settings, and even outperformed prior goal-relabelling methods on goal-reaching tasks.</li>
<li>It turns out that relabeling with the goal actually reached is exactly equivalent to doing inverse RL with a certain sparse reward function. This result allows us to interpret previous goal-relabeling techniques as inverse RL, thus providing a stronger theoretical foundation for these methods. More generally, this result is exciting</li>
</ol>
<h2 id="future-directions">Future Directions</h2>
<p>In this article, we discussed how RL can be viewed as solving a sequence of standard supervised learning problems but using optimized (relabled) data. This success of deep supervised learning over the past decade might indicate that such approaches to RL may be easier to use in practice. While the progress so far is promising, there are several open questions. Firstly, what could be other (better) ways of obtaining optimized data? Does re-weighting or recombining existing experience induce bias in the learning process? How should the RL algorithm explore to obtain better data? Methods and analyses that make progress on this front are likely to also provide insights for algorithms derived from alternate perspectives on RL. Secondly, these methods might provide an easy way to carry over practical techniques as well as theoretical analyses from deep learning to RL, which are otherwise hard due to non-convex objectives (e.g., policy gradients) or mismatch in optimization and test-time objective (e.g., Bellman error and policy return). We are excited about several prospects these methods offer: improved practical RL algorithms, improved understanding of RL methods, etc.</p>
<hr />
<p>We thank Allen Zhu, Shreyas Chaudhari, Sergey Levine, and Daniel Seita for feedback on this
post.</p>
<p>This post is based on the following papers:</p>
<ul>
<li>Ghosh, D., Gupta, A., Fu, J., Reddy, A., Devin, C., Eysenbach, B., & Levine,
S. (2019). Learning to Reach Goals via Iterated Supervised Learning
<a href="https://arxiv.org/abs/1912.06088">arXiv:1912.06088</a>.</li>
<li>Eysenbach, B., Geng, X., Levine, S., & Salakhutdinov, R. (2020). Rewriting
History with Inverse RL: Hindsight Inference for Policy Improvement. <a href="https://arxiv.org/abs/2002.11089">NeurIPS 2020 (oral)</a>.</li>
<li>Kumar, A., Peng, X. B., & Levine, S. (2019). Reward-Conditioned Policies.
<a href="https://arxiv.org/abs/1912.13465">arXiv:1912.13465</a>.</li>
</ul>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Our lower bound is technically an <em>evidence lower bound</em>, so coordinate ascent on it is equivalent to expectation maximization. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:stable" role="doc-endnote">
<p>While supervised learning is generally more stable than RL, <em>iterated</em> supervised learning may be less stable than supervised learning on a fixed dataset. <a href="#fnref:stable" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Tue, 13 Oct 2020 02:00:00 -0700
https://bairblog.github.io/2020/10/13/supervised-rl/
https://bairblog.github.io/2020/10/13/supervised-rl/Plan2Explore: Active Model-Building for Self-Supervised Visual Reinforcement Learning<meta name="twitter:title" content="Plan2Explore: Active Model-Building for Self-Supervised Visual Reinforcement Learning" />
<meta name="twitter:card" content="summary_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/plan2explore/figure1_teaser.gif" />
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure1_teaser.gif" height="" width="90%" />
<br />
</p>
<p><em>This post is cross-listed <a href="https://blog.ml.cmu.edu/2020/10/06/plan2explore/">on the CMU ML blog</a></em>.</p>
<p>To operate successfully in unstructured open-world environments, autonomous intelligent agents need to solve many different tasks and learn new tasks quickly. Reinforcement learning has enabled artificial agents to solve complex tasks both in <a href="https://deepmind.com/research/case-studies/alphago-the-story-so-far">simulation</a> and <a href="https://ai.googleblog.com/2018/06/scalable-deep-reinforcement-learning.html"> real-world</a>. However, it requires collecting large amounts of experience in the environment, and the agent learns only that particular task, much like a student memorizing a lecture without understanding. Self-supervised reinforcement learning has emerged <a href="https://pathak22.github.io/noreward-rl/">as</a> <a href="https://arxiv.org/abs/1903.03698">an</a> <a href="https://arxiv.org/abs/1907.01657">alternative</a>, where the agent only follows an intrinsic objective that is independent of any individual task, analogously to <a href=" https://www.youtube.com/watch?v=SaJL4SLfrcY&ab_channel=InriaChannel">unsupervised representation learning</a>. After experimenting with the environment without supervision, the agent builds an understanding of the environment, which enables it to adapt to specific downstream tasks more efficiently.</p>
<p>In this post, we explain our recent publication that develops <a href="https://ramanans1.github.io/plan2explore/">Plan2Explore</a>. While many recent papers on self-supervised reinforcement learning have focused on <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html">model-free</a> agents that can only capture knowledge by remembering behaviors practiced during self-supervision, our agent learns an internal <a href="https://bair.berkeley.edu/blog/2019/12/12/mbpo/">world model</a> that lets it extrapolate beyond memorized facts by predicting what will happen as a consequence of different potential actions. The world model captures general knowledge, allowing Plan2Explore to quickly solve new tasks through planning in its own imagination. In contrast to the model-free prior work, the world model further enables the agent to explore what it expects to be novel, rather than repeating what it found novel in the past. Plan2Explore obtains state-of-the-art zero-shot and few-shot performance on continuous control benchmarks with high-dimensional input images. To make it easy to experiment with our agent, we are open-sourcing the complete <a href="https://github.com/ramanans1/plan2explore">source code</a>.</p>
<!--more-->
<h1 id="how-does-plan2explore-work">How does Plan2Explore work?</h1>
<p>At a high level, Plan2Explore works by training a world model, exploring to
maximize the information gain for the world model, and using the world model at
test time to solve new tasks (see figure above). Thanks to effective
exploration, the learned world model is general and captures information that
can be used to solve multiple new tasks with no or few additional environment
interactions. We discuss each part of the Plan2Explore algorithm individually
below. We assume a basic understanding of <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement
learning</a> in this post.</p>
<h1 id="learning-the-world-model">Learning the world model</h1>
<p>Plan2Explore learns a world model that predicts future outcomes given past
observations $o_{1:t}$ and actions $a_{1:t}$. To handle high-dimensional image
observations, we encode them into lower-dimensional features $h$ and use an <a href="https://ai.googleblog.com/2019/02/introducing-planet-deep-planning.html">RSSM</a>
model that predicts forward in a compact latent state-space $s$. The latent
state aggregates information from past observations and is trained for future
prediction, using a variational objective that reconstructs future
observations. Since the latent state learns to represent the observations,
during planning we can predict entirely in the latent state without decoding
the images themselves. The figure below shows our latent prediction
architecture.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure2_model.gif" height="" width="90%" />
<br />
</p>
<h1 id="a-novelty-metric-for-active-model-building">A novelty metric for active model-building</h1>
<p>To learn an accurate and general world model we need an exploration strategy
that collects new and informative data. To achieve this, Plan2Explore uses a
novelty metric derived from the model itself. The novelty metric measures the
expected information gained about the environment upon observing the new data.
As the figure below shows, this is approximated by the disagreement <a href="https://arxiv.org/abs/1612.01474">of</a>
<a href="https://pathak22.github.io/exploration-by-disagreement/">an</a> <a href="https://arxiv.org/abs/2002.08791">ensemble</a> of $K$ latent models.
Intuitively, large latent disagreement reflects high model uncertainty, and
obtaining the data point would reduce this uncertainty. By maximizing latent
disagreement, Plan2Explore selects actions that lead to the largest information
gain, therefore improving the model as quickly as possible.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure3_disagreement.gif" height="" width="65%" />
<br />
</p>
<h1 id="planning-for-future-novelty">Planning for future novelty</h1>
<p>To effectively maximize novelty, we need to know which parts of the environment
are still unexplored. Most prior work on self-supervised exploration used
model-free methods that reinforce past behavior that resulted in novel
experience. This makes these methods slow to explore: since they can only
repeat exploration behavior that was successful in the past, they are unlikely
to stumble onto something novel. In contrast, Plan2Explore plans for expected
novelty by measuring model uncertainty of imagined future outcomes. By seeking
trajectories that have the highest uncertainty, Plan2Explore explores exactly
the parts of the environments that were previously unknown.</p>
<p>To choose actions $a$ that optimize the exploration objective, Plan2Explore
leverages the learned world model as shown in the figure below. The actions are
selected to maximize the expected novelty of the entire future sequence
$s_{t:T}$, using imaginary rollouts of the world model to estimate the novelty.
To solve this optimization problem, we use the <a href="https://ai.googleblog.com/2020/03/introducing-dreamer-scalable.html">Dreamer</a>
agent, which learns a policy $\pi_\phi$ using a value function and analytic
gradients through the model. The policy is learned completely inside the
imagination of the world model. During exploration, this imagination training
ensures that our exploration policy is always up-to-date with the current world
model and collects data that are still novel. The figure below shows the
imagination training process.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure4_policy.gif" height="" width="90%" />
<br />
</p>
<h1 id="evaluation-of-curiosity-driven-exploration-behavior">Evaluation of curiosity-driven exploration behavior</h1>
<p>We evaluate Plan2Explore on the <a href="https://github.com/deepmind/dm_control">DeepMind Control Suite</a>, which
features 20 tasks requiring different control skills, such as locomotion,
balancing, and simple object manipulation. The agent only has access to image
observations and no proprioceptive information. Instead of random exploration,
which fails to take the agent far from the initial position, Plan2Explore leads
to diverse movement strategies like jumping, running, and flipping, as shown in
the figure below. Later, we will see that these are effective practice episodes
that enable the agent to quickly learn to solve various continuous control
tasks.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure5_gif1.gif" height="190" width="" />
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure5_gif2.gif" height="190" width="" />
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure5_gif3.gif" height="190" width="" /><br />
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure5_gif4.gif" height="190" width="" />
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure5_gif5.gif" height="190" width="" />
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure5_gif6.gif" height="190" width="" />
<br />
</p>
<h1 id="evaluation-of-downstream-task-performance">Evaluation of downstream task performance</h1>
<p>Once an accurate and general world model is learned, we test Plan2Explore on
previously unseen tasks. Given a task specified with a reward function, we use
the model to optimize a policy for that task. Similar to our exploration
procedure, we optimize a new value function and a new policy head for the
downstream task. This optimization uses only predictions imagined by the model,
enabling Plan2Explore to solve new downstream tasks in a zero-shot manner
without any additional interaction with the world.</p>
<p>The following plot shows the performance of Plan2Explore on tasks from DM
Control Suite. Before 1 million environment steps, the agent doesn’t know the
task and simply explores. The agent solves the task as soon as it is provided
at 1 million steps, and keeps improving fast in a few-shot regime after that.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/plan2explore/figure6_plot.png" height="" width="" />
<br />
</p>
<p>Plan2Explore (<font color="green"><strong>—</strong></font>) is able to solve most of the tasks we benchmarked. Since prior
work on self-supervised reinforcement learning used model-free agents that are
not able to adapt in a zero-shot manner (<a href="https://pathak22.github.io/noreward-rl/">ICM</a>, <font color="blue"><strong>—</strong></font>), or did not use
image observations, we compare by adapting this prior work to our model-based
Plan2Explore setup. Our latent disagreement objective outperforms other
previously proposed objectives. More interestingly, the final performance of
Plan2Explore is comparable to the state-of-the-art <a href="https://ai.googleblog.com/2020/03/introducing-dreamer-scalable.html">oracle</a>
agent that requires task rewards throughout training (<font color="yellow"><strong>—</strong></font>). In our <a href="https://arxiv.org/abs/2005.05960">paper</a>, we further report
performance of Plan2Explore in the zero-shot setting where the agent needs to
solve the task before any task-oriented practice.</p>
<h1 id="future-directions">Future directions</h1>
<p>Plan2Explore demonstrates that effective behavior can be learned through
self-supervised exploration only. This opens multiple avenues for future
research:</p>
<ul>
<li>
<p>First, to apply self-supervised RL to a variety of settings, future work will
investigate different ways of specifying the task and deriving behavior from
the world model. For example, the task could be specified with a
demonstration, description of the desired goal state, or communicated to the
agent in natural language.</p>
</li>
<li>
<p>Second, while Plan2Explore is completely self-supervised, in many cases a
weak supervision signal is available, such as in hard exploration games,
human-in-the-loop learning, or real life. In such a semi-supervised setting,
it is interesting to investigate how weak supervision can be used to steer
exploration towards the relevant parts of the environment.</p>
</li>
<li>
<p>Finally, Plan2Explore has the potential to improve the data efficiency of
real-world robotic systems, where exploration is costly and time-consuming,
and the final task is often unknown in advance.</p>
</li>
</ul>
<p>By designing a scalable way of planning to explore in unstructured environments
with visual observations, Plan2Explore provides an important step toward
self-supervised intelligent machines.</p>
<hr />
<p>We would like to thank Georgios Georgakis and the editors of CMU and BAIR blogs for the useful feedback.</p>
<p>This post is based on the following paper:</p>
<ul>
<li>Planning to Explore via Self-Supervised World Models<br />
Ramanan Sekar*, Oleh Rybkin*, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak<br />
Thirty-seventh International Conference Machine Learning (ICML), 2020.<br />
<a href="https://arxiv.org/abs/2005.05960">arXiv</a>, <a href="https://ramanans1.github.io/plan2explore/">Project Website</a></li>
</ul>
Tue, 06 Oct 2020 02:00:00 -0700
https://bairblog.github.io/2020/10/06/plan2explore/
https://bairblog.github.io/2020/10/06/plan2explore/