Quadruped robot learning locomotion skills by imitating a dog.
Whether it’s a dog chasing after a ball, or a monkey swinging through the
trees, animals can effortlessly perform an incredibly rich repertoire of agile
locomotion skills. But designing controllers that enable legged robots to
replicate these agile behaviors can be a very challenging task. The superior
agility seen in animals, as compared to robots, might lead one to wonder: can
we create more agile robotic controllers with less effort by directly imitating
animals?
In this work, we present a framework for learning robotic locomotion skills by
imitating animals. Given a reference motion clip recorded from an animal (e.g.
a dog), our framework uses reinforcement learning to train a control policy
that enables a robot to imitate the motion in the real world. Then, by simply
providing the system with different reference motions, we are able to train a
quadruped robot to perform a diverse set of agile behaviors, ranging from fast
walking gaits to dynamic hops and turns. The policies are trained primarily in
simulation, and then transferred to the real world using a latent space
adaptation technique, which is able to efficiently adapt a policy using only a
few minutes of data from the real robot.
Consequently, it is critical that RL policies are robust: both to naturally
occurring distribution shift, and to malicious attacks by adversaries.
Unfortunately, we find that RL policies which perform at a high-level in normal
situations can harbor serious vulnerabilities which can be exploited by an
adversary.
Reinforcement learning has seen a great deal of success in solving complex decision making problems ranging from robotics to games to supply chain management to recommender systems. Despite their success, deep reinforcement learning algorithms can be exceptionally difficult to use, due to unstable training, sensitivity to hyperparameters, and generally unpredictable and poorly understood convergence properties. Multiple explanations, and corresponding solutions, have been proposed for improving the stability of such methods, and we have seen good progress over the last few years on these algorithms. In this blog post, we will dive deep into analyzing a central and underexplored reason behind some of the problems with the class of deep RL algorithms based on dynamic programming, which encompass the popular DQN and soft actor-critic (SAC) algorithms – the detrimental connection between data distributions and learned models.
Look at the images above. If I asked you to bring me a picnic blanket in the
grassy field, would you be able to? Of course. If I asked you to bring over a
cart full of food for a party, would you push the cart along the paved path or
on the grass? Obviously the paved path.
In deep learning, using more compute (e.g., increasing model size, dataset
size, or training steps) often leads to higher accuracy. This is especially
true given the recent success of unsupervised pretraining methods like
BERT, which can scale up training to very large models and datasets.
Unfortunately, large-scale training is very computationally expensive,
especially without the hardware resources of large industry research labs.
Thus, the goal in practice is usually to get high accuracy without exceeding
one’s hardware budget and training time.
For most training budgets, very large models appear impractical. Instead, the
go-to strategy for maximizing training efficiency is to use models with small
hidden sizes or few layers because these models run faster and use less memory.
In this blog post, we share our experiences in developing two critical software
libraries that many BAIR researchers use to execute large-scale AI
experiments: Ray Tune and the Ray Cluster Launcher, both of which now
back many popular open-source AI libraries.
As AI research becomes more compute intensive, many AI researchers have become
squeezed for time and resources. Many researchers now rely on cloud providers
like Amazon Web Services or Google Compute Platform to access the huge amounts
of computational resources necessary for training large models.
All living organisms carve out environmental niches within which they can
maintain relative predictability amidst the ever-increasing entropy around them
(1),
(2).
Humans, for example, go to great lengths to shield themselves from surprise —
we band together in millions to build cities with homes, supplying water, food,
gas, and electricity to control the deterioration of our bodies and living
spaces amidst heat and cold, wind and storm. The need to discover and maintain
such surprise-free equilibria has driven great resourcefulness and skill in
organisms across very diverse natural habitats. Motivated by this, we ask:
could the motive of preserving order amidst chaos guide the automatic
acquisition of useful behaviors in artificial agents?
People give massive amounts of their personal data to companies every day and
these data are used to generate tremendous business values. Some
economists
and
politicians
argue that people should be paid for their contributions—but the
million-dollar question is: by how much?
This article discusses methods proposed in our recent
AISTATS and
VLDB papers that attempt to answer this
question in the machine learning context. This is joint work with David Dao,
Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Nick Hynes, Bo Li, Ce Zhang,
Costas J. Spanos, and Dawn Song, as well as a collaborative effort between UC
Berkeley, ETH Zurich, and UIUC. More information about the work in our group
can be found here.
This work presents AVID, a method that allows a robot to learn a task, such as
making coffee, directly by watching a human perform the task.
One of the most important markers of intelligence is the ability to learn by
watching others. Humans are particularly good at this, often being able to
learn tasks by observing other humans. This is possible because we are not
simply copying the actions that other humans take. Rather, we first imagine
ourselves performing the task, and this provides a starting point for further
practicing the task in the real world.
Robots are not yet adept at learning by watching humans or other robots. Prior
methods for imitation learning, where robots learn from demonstrations of the
task, typically assume that the demonstrations can be given directly through
the robot, using techniques such as kinesthetic
teaching or
teleoperation. This assumption limits
the applicability of robots in the real world, where robots may be frequently
asked to learn new tasks quickly and without programmers, trained roboticists,
or specialized hardware setups. Can we instead have robots learn directly from
a video of a human demonstration?
Reinforcement learning systems can make decisions in one of two ways. In the model-based approach, a system uses a predictive model of the world to ask questions of the form “what will happen if I do x?” to choose the best x1. In the alternative model-free approach, the modeling step is bypassed altogether in favor of learning a control policy directly. Although in practice the line between these two techniques can become blurred, as a coarse guide it is useful for dividing up the space of algorithmic possibilities.
Predictive models can be used to ask “what if?” questions to guide future decisions.
The natural question to ask after making this distinction is whether to use such a predictive model. The field has grappled with this question for quite a while, and is unlikely to reach a consensus any time soon. However, we have learned enough about designing model-based algorithms that it is possible to draw some general conclusions about best practices and common pitfalls. In this post, we will survey various realizations of model-based reinforcement learning methods. We will then describe some of the tradeoffs that come into play when using a learned predictive model for training a policy and how these considerations motivate a simple but effective strategy for model-based reinforcement learning. The latter half of this post is based on our recent paper on model-based policy optimization, for which code is available here.