Sequence Modeling Solutions for Reinforcement Learning Problems

Long-horizon predictions of (top) the Trajectory Transformer compared to those of (bottom) a single-step dynamics model.

Modern machinelearningsuccessstories often have one thing in common: they use methods that scale gracefully with ever-increasing amounts of data.
This is particularly clear from recent advances in sequence modeling, where simply increasing the size of a stable architecture and its training set leads to qualitativelydifferentcapabilities.^{1}

Meanwhile, the situation in reinforcement learning has proven more complicated.
While it has been possible to apply reinforcement learning algorithms to large-scaleproblems, generally there has been much more friction in doing so.
In this post, we explore whether we can alleviate these difficulties by tackling the reinforcement learning problem with the toolbox of sequence modeling.
The end result is a generative model of trajectories that looks like a large language model and a planning algorithm that looks like beam search.
Code for the approach can be found here.

Fig. 1: The BRIDGE dataset contains 7200 demonstrations of kitchen-themed manipulation tasks across 71 tasks in 10 domains. Note that any GIF compression artifacts in this animation are not present in the dataset itself.

When we apply robot learning methods to real-world systems, we must usually collect new datasets for every task, every robot, and every environment. This is not only costly and time-consuming, but it also limits the size of the datasets that we can use, and this, in turn, limits generalization: if we train a robot to clean one plate in one kitchen, it is unlikely to succeed at cleaning any plate in any kitchen. In other fields, such as computer vision (e.g., ImageNet) and natural language processing (e.g., BERT), the standard approach to generalization is to utilize large, diverse datasets, which are collected once and then reused repeatedly. Since the dataset is reused for many models, tasks, and domains, the up-front cost of collecting such large reusable datasets is worth the benefits. Thus, to obtain truly generalizable robotic behaviors, we may need large and diverse datasets, and the only way to make this practical is to reuse data across many different tasks, environments, and labs (i.e. different background lighting conditions, etc.).

To understand neural networks, researchers often use similarity metrics to measure how similar or different two neural networks are to each other. For instance, they are used to compare vision transformers to convnets [1], to understand transfer learning [2], and to explain the success of standard training practices for deep models [3]. Below is an example visualization using similarity metrics; specifically we use the popular CKA similarity metric (introduced in [4]) to compare two transformer models across different layers:

Figure 1. CKA (Centered Kernel Alignment) similarity between two networks trained identically except for random initialization. Lower values (darker colors) are more similar. CKA suggests that the two networks have similar representations.

An example of our method deployed on a Clearpath Jackal ground robot (left) exploring a suburban environment to find a visual target (inset). (Right) Egocentric observations of the robot.

Imagine you’re in an unfamiliar neighborhood with no house numbers and I give you a photo that I took a few days ago of my house, which is not too far away. If you tried to find my house, you might follow the streets and go around the block looking for it. You might take a few wrong turns at first, but eventually you would locate my house. In the process, you would end up with a mental map of my neighborhood. The next time you’re visiting, you will likely be able to navigate to my house right away, without taking any wrong turns.

Such exploration and navigation behavior is easy for humans. What would it take for a robotic learning algorithm to enable this kind of intuitive navigation capability? To build a robot capable of exploring and navigating like this, we need to learn from diverse prior datasets in the real world. While it’s possible to collect a large amount of data from demonstrations, or even with randomized exploration, learning meaningful exploration and navigation behavior from this data can be challenging – the robot needs to generalize to unseen neighborhoods, recognize visual and dynamical similarities across scenes, and learn a representation of visual observations that is robust to distractors like weather conditions and obstacles. Since such factors can be hard to model and transfer from simulated environments, we tackle these problems by teaching the robot to explore using only real-world data.

Many experimental works have observed that generalization in deep RL appears to be difficult: although RL agents can learn to perform very complex tasks, they don’t seem to generalize over diverse task distributions as well as the excellent generalization of supervised deep nets might lead us to expect. In this blog post, we will aim to explain why generalization in RL is fundamentally harder, and indeed more difficult even in theory.

We will show that attempting to generalize in RL induces implicit partial observability, even when the RL problem we are trying to solve is a standard fully-observed MDP. This induced partial observability can significantly complicate the types of policies needed to generalize well, potentially requiring counterintuitive strategies like information-gathering actions, recurrent non-Markovian behavior, or randomized strategies. Ordinarily, this is not necessary in fully observed MDPs but surprisingly becomes necessary when we consider generalization from a finite training set in a fully observed MDP. This blog post will walk through why partial observability can implicitly arise, what it means for the generalization performance of RL algorithms, and how methods can account for partial observability to generalize well.

Figure 1: Offline Model-Based Optimization (MBO): The goal of offline MBO is to optimize an unknown objective function $f(x)$ with respect to $x$, provided access to only as static, previously-collected dataset of designs.

Machine learning methods have shown tremendous promise on prediction problems: predicting the efficacy of a drug, predicting how a protein will fold, or predicting the strength of a composite material. But can we use machine learning for design? Conventionally, such problems have been tackled with black-box optimization procedures that repeatedly query an objective function. For instance, if designing a drug, the algorithm will iteratively modify the drug, test it, then modify it again. But when evaluating the efficacy of a candidate design involves conducting a real-world experiment, this can quickly become prohibitive. An appealing alternative is to create designs from data. Instead of requiring active synthesis and querying, can we devise a method that simply examines a large dataset of previously tested designs (e.g., drugs that have been evaluated before), and comes up with a new design that is better? We call this offline model-based optimization (offline MBO), and in this post, we discuss offline MBO methods and some recent advances.

Fig 1. Measures of generalization performance for neural networks trained on four different boolean functions (colors) with varying training set size. For both MSE (left) and learnability (right), theoretical predictions (curves) closely match true performance (dots).

Deep learning has proven a stunning success for countless problems of interest, but this success belies the fact that, at a fundamental level, we do not understand why it works so well. Many empirical phenomena, well-known to deep learning practitioners, remain mysteries to theoreticians. Perhaps the greatest of these mysteries has been the question of generalization: why do the functions learned by neural networks generalize so well to unseen data? From the perspective of classical ML, neural nets’ high performance is a surprise given that they are so overparameterized that they could easily represent countless poorly-generalizing functions.

Diagram of MURAL, our method for learning uncertainty-aware rewards for RL. After the user provides a few examples of desired outcomes, MURAL automatically infers a reward function that takes into account these examples and the agent’s uncertainty for each state.

Although reinforcement learning has shown success in domains suchasrobotics, chip placement and playingvideogames, it is usually intractable in its most general form. In particular, deciding when and how to visit new states in the hopes of learning more about the environment can be challenging, especially when the reward signal is uninformative. These questions of reward specification and exploration are closely connected — the more directed and “well shaped” a reward function is, the easier the problem of exploration becomes. The answer to the question of how to explore most effectively is likely to be closely informed by the particular choice of how we specify rewards.

For unstructured problem settings such as robotic manipulation and navigation — areas where RL holds substantial promise for enabling better real-world intelligent agents — reward specification is often the key factor preventing us from tackling more difficult tasks. The challenge of effective reward specification is two-fold: we require reward functions that can be specified in the real world without significantly instrumenting the environment, but also effectively guide the agent to solve difficult exploration problems. In our recent work, we address this challenge by designing a reward specification technique that naturally incentivizes exploration and enables agents to explore environments in a directed way.

Earlier this year, my research group commissioned 6 questions for professional forecasters to predict about AI. Broadly speaking, 2 were on geopolitical aspects of AI and 4 were on future capabilities:

Geopolitical:

How much larger or smaller will the largest Chinese ML experiment be compared to the largest U.S. ML experiment, as measured by amount of compute used?

How much computing power will have been used by the largest non-incumbent (OpenAI, Google, DeepMind, FB, Microsoft), non-Chinese organization?

Future capabilities:

What will SOTA (state-of-the-art accuracy) be on the MATH dataset?

What will SOTA be on the Massive Multitask dataset (a broad measure of specialized subject knowledge, based on high school, college, and professional exams)?

What will be the best adversarially robust accuracy on CIFAR-10?

What will SOTA be on Something Something v2? (A video recognition dataset)

Forecasters output a probability distribution over outcomes for 2022, 2023, 2024, and 2025. They have financial incentives to produce accurate forecasts; the rewards total \$5k per question (\$30k total) and payoffs are (close to) a proper scoring rule, meaning forecasters are rewarded for outputting calibrated probabilities.

Processing raw sensory inputs is crucial for applying deep RL algorithms to real-world problems.
For example, autonomous vehicles must make decisions about how to drive safely given information flowing from cameras, radar, and microphones about the conditions of the road, traffic signals, and other cars and pedestrians.
However, direct “end-to-end” RL that maps sensor data to actions (Figure 1, left) can be very difficult because the inputs are high-dimensional, noisy, and contain redundant information.
Instead, the challenge is often broken down into two problems (Figure 1, right): (1) extract a representation of the sensory inputs that retains only the relevant information, and (2) perform RL with these representations of the inputs as the system state.

Figure 1. Representation learning can extract compact representations of states for RL.

A wide variety of algorithms have been proposed to learn lossy state representations in an unsupervised fashion (see this recent tutorial for an overview).
Recently, contrastive learning methods have proven effective on RL benchmarks such as Atari and DMControl (Oord et al. 2018, Stooke et al. 2020, Schwarzer et al. 2021), as well as for real-world robotic learning (Zhan et al.).
While we could ask which objectives are better in which circumstances, there is an even more basic question at hand: are the representations learned via these methods guaranteed to be sufficient for control?
In other words, do they suffice to learn the optimal policy, or might they discard some important information, making it impossible to solve the control problem?
For example, in the self-driving car scenario, if the representation discards the state of stoplights, the vehicle would be unable to drive safely.
Surprisingly, we find that some widely used objectives are not sufficient, and in fact do discard information that may be needed for downstream tasks.