Alignment Newsletter #18
Learning Dexterity (Many people at OpenAI): Most current experiments with robotics work on relatively small state spaces (think 7 degrees of freedom, each a real number) and are trained in simulation. If we could throw a lot of compute at the problem, could we do significantly better? Yes! Using the same general approach as with OpenAI Five, OpenAI has built a system called Dactyl, which allows a physical real-world dexterous hand to manipulate a block. It may not seem as impressive as the videos of humanoids running through obstacle courses, but this is way harder than your typical Mujoco environment, especially since they aim to get it working on a real robot. As with OpenAI Five, they only need a reward function (I believe not even a shaped reward function in this case), a simulator, and a good way to explore. In this setting though, “exploration” is actually domain randomization, where you randomly set parameters that you are uncertain about (such as the coefficient of friction between two surfaces), so that the learned policy is robust to distribution shift from the simulator to the real world. (OpenAI Five also used domain randomization, but in that case it was not because we were uncertain about the parameters in the simulator, but because the policy was too specialized to the kinds of characters and heroes it was seeing, and randomizing those properties exposed it to a wider variety of scenarios so it had to learn more general policies.) They use 6144 CPU cores and 8 GPUs, which is much less than for OpenAI Five, but much more than for a typical Mujoco environment.
They do separate the problem into two pieces—first, they learn how to map from camera pictures to a 3D pose (using convolutional nets), and second, they use RL to choose actions based on the 3D pose. They can also get better estimates of the 3D pose using motion tracking. They find that the CNN is almost as good as motion tracking, and that the domain randomization is crucial for getting the system to actually work.
They also have a couple of sections on surprising results and things that didn’t work. Probably the most interesting part was that they didn’t need to use the tactile sensors to get these results. They couldn’t get these sensors in simulation, so they just did without and it seems to have worked fine. It also turns out that the robot’s reaction time wasn’t too important—there wasn’t a big difference in changing from 80ms reaction time to 40ms reaction time; in fact, this just increased the required training time without much benefit.
Probably the most interesting part of the post is the last paragraph (italics indicates my notes): “This project completes a full cycle of AI development that OpenAI has been pursuing for the past two years: we’ve developed a new learning algorithm (PPO), scaled it massively to solve hard simulated tasks (OpenAI Five), and then applied the resulting system to the real world (this post). Repeating this cycle at increasing scale is the primary route we are pursuing to increase the capabilities of today’s AI systems towards safe artificial general intelligence.”
My opinion: This is pretty exciting—transferring a policy from simulation to the real world is notoriously hard, but it turns out that as long as you use domain randomization (and 30x the compute) it actually is possible to transfer the policy. I wish they had compared the success probability in simulation to the success probability in the real world—right now I don’t know how well the policy transferred. (That is, I want to evaluate how well domain randomization solved the distribution shift problem.) Lots of other exciting things too, but they are pretty similar to the exciting things about OpenAI Five, such as the ability to learn higher level strategies like finger pivoting and sliding (analogously, fighting over mid or 5-man push).
Variational Option Discovery Algorithms (Joshua Achiam et al): We can hope to do hierarchical reinforcement learning by first discovering several useful simple policies (or “options”) by just acting in the environment without any reward function, and then using these options as primitive actions in a higher level policy that learns to do some task (using a reward function). How could we learn the options without a reward function though? Intuitively, we would like to learn behaviors that are different from each other. One way to frame this would be to think of this as an encoder-decoder problem. Suppose we want to learn K options. Then, we can give the encoder a number in the range [1, K], have it “encode” the number into a trajectory τ (that is, our encoder is a policy), and then have a decoder take τ and recover the original number. We train the encoder/policy and decoder jointly, optimizing them to successfully recover the original number (called a context). Intuitively, the encoder/policy wants to have very different behaviors for each option, so that it easy for decoder to figure out the context from the trajectory τ. However, a simple solution would be for the encoder/policy to just take a particular series of actions for each context and then stop, and the decoder learns an exact mapping from final states to contexts. To avoid this, we can decrease the capacity of the decoder (i.e. don’t give it too many layers), and we also optimize for the entropy of the encoder/policy, which encourages the encoder/policy to be more stochastic, and so it is more likely to learn overall behaviors that can still have some stochasticity, while still allowing the decoder to decode them. It turns out that this optimization problem has a one-to-one correspondence with variational autoencoders, motivating the name “variational option discovery”. To stabilize training, they start with a small K, and increase K whenever the decoder becomes powerful enough. They evaluate in Gym environments, a simulated robotic hand, and a new “Toddler” environment. They find that the scheme works well (in terms of maximizing the objective) in all environments, but that the learned behaviors no longer look natural in the Toddler environment (which is the most complex). They also show that the learned policies can be used for hierarchical RL in the AntMaze problem.
This is very similar to the recent Diversity Is All You Need. DIAYN aims to decode the context from every state along a trajectory, which incentivizes it to find behaviors of the form “go to a goal state”, whereas VALOR (this work) decodes the context from the entire trajectory (without actions, which would make the decoder’s job too easy), which allows it to learn behaviors with motion, such as “go around in a circle”.
My opinion: It’s really refreshing to read a paper with a negative result about their own method (specifically, that the learned behaviors on Toddler do not look natural). It makes me trust the rest of their paper so much more. (A very gameable instinct, I know.) While they were able to find a fairly diverse set of options, and could interpolate between them, their experiments found that using this for hierarchical RL was about as good as training hierarchical RL from scratch. I guess I’m just saying things they’ve already said—I think they’ve done such a great job writing this paper that they’ve already told me what my opinion about the topic should be, so there’s not much left for me to say.
Technical AI alignment
A Gym Gridworld Environment for the Treacherous Turn (Michaël Trazzi): An example Gym environment in which the agent starts out “weak” (having an inaccurate bow) and later becomes “strong” (getting a bow with perfect accuracy), after which the agent undertakes a treacherous turn in order to kill the supervisor and wirehead.
My opinion: I’m a fan of executable code that demonstrates the problems that we are worrying about—it makes the concept (in this case, a treacherous turn) more concrete. In order to make it more realistic, I would want the agent to grow in capability organically (rather than simply getting a more powerful weapon). It would really drive home the point if the agent undertook a treacherous turn the very first time, whereas in this post I assume it learned using many episodes of trial-and-error that a treacherous turn leads to higher reward. This seems hard to demonstrate with today’s ML in any complex environment, where you need to learn from experience instead of using eg. value iteration, but it’s not out of the question in a continual learning setup where the agent can learn a model of the world.
Counterfactuals, thick and thin (Nisan): There are many different ways to formalize counterfactuals (the post suggests three such ways). Often, for any given way of formalizing counterfactuals, there are many ways you could take a counterfactual, which give different answers. When considering the physical world, we have strong causal models that can tell us which one is the “correct” counterfactual. However, there is no such method for logical counterfactuals yet.
My opinion: I don’t think I understood this post, so I’ll abstain on an opinion.
Decisions are not about changing the world, they are about learning what world you live in (shminux): The post tries to reconcile decision theory (in which agents can “choose” actions) with the deterministic physical world (in which nothing can be “chosen”), using many examples from decision theory.
Handling groups of agents
Multi-Agent Generative Adversarial Imitation Learning (Jiaming Song et al): This paper generalizes GAIL (which was covered last week) to the multiagent setting, where we want to imitate a group of interacting agents. They want to find a Nash equilibrium in particular. They formalize the Nash equilibrium constraints and use this to motivate a particular optimization problem for multiagent IRL, that looks very similar to their optimization problem for regular IRL in GAIL. After that, it is quite similar to GAIL—they use a regularizer ψ for the reward functions, show that the composition of multiagent RL and multiagent IRL can be solved as a single optimization problem involving the convex conjugate of ψ, and propose a particular instantiation of ψ that is data-dependent, giving an algorithm. They do have to assume in the theory that the multiagent RL problem has a unique solution, which is not typically true, but may not be too important. As before, to make the algorithm practical, they structure it like a GAN, with discriminators acting like reward functions. What if we have prior information that the game is cooperative or competitive? In this case, they propose changing the regularizer ψ, making it keep all the reward functions the same (if cooperative), making them negations of each other (in two-player zero-sum games), or leaving it as is. They evaluate in a variety of simple multiagent games, as well as a plank environment in which the environment changes between training and test time, thus requiring the agent to learn a robust policy, and find that the correct variant of MAGAIL (cooperative/competitive/neither) outperforms both behavioral cloning and single-agent GAIL (which they run N times to infer a separate reward for each agent).
My opinion: Multiagent settings seem very important (since there does happen to be more than one human in the world). This looks like a useful generalization from the single agent case to the multiagent case, though it’s not clear to me that this deals with the major challenges that come from multiagent scenarios. One major challenge is that there is no longer a single optimal equilibrium when there are multiple agents, but they simply assume in their theoretical analysis that there is only one solution. Another one is that it seems more important that the policies take history into account somehow, but they don’t do this. (If you don’t take history into account, then you can’t learn strategies like tit-for-tat in the iterated prisoner’s dilemma.) But to be clear I think this is the standard setup for multiagent RL—it seems like field is not trying to deal with this issue yet (even though they could using eg. a recurrent policy, I think?)
Safely and usefully spectating on AIs optimizing over toy worlds (Alex Mennen): One way to achieve safety would be to build an AI that optimizes in a virtual world running on a computer, and doesn’t care about the physical world. Even if it realizes that it can break out and eg. get more compute, these sorts of changes to the physical world would not be helpful for the purpose of optimizing the abstract computational object that is the virtual world. However, if we take the results of the AI and build them in the real world, that causes a distributional shift from the toy world to the real world that could be catastrophic. For example, if the AI created another agent in the toy world that did reasonable things in the toy world, when we bring it to the real world it may realize that it can instead manipulate humans in order to do things.
My opinion: It’s not obvious to me, even on the “optimizing an abstract computational process” model, why an AI would not want get more compute—it can use this compute for itself, without changing the abstract computational process it is optimizing, and it will probably do better. It seems that if you want to get this to work, you need to have the AI want to compute the result of running itself without any modification or extra compute on the virtual world. This feels very hard to me. Separately, I also find it hard to imagine us building a virtual world that is similar enough to the real world that we are able to transfer solutions between the two, even with some finetuning in the real world.
Sandboxing by Physical Simulation? (moridinamael)
Evaluating and Understanding the Robustness of Adversarial Logit Pairing (Logan Engstrom, Andrew Ilyas and Anish Athalye)
AI strategy and policy
The Facets of Artificial Intelligence: A Framework to Track the Evolution of AI (Fernando Martinez-Plumed et al)
Podcast: Six Experts Explain the Killer Robots Debate (Paul Scharre, Toby Walsh, Richard Moyes, Mary Wareham, Bonnie Docherty, Peter Asaro, and Ariel Conn)
Learning Dexterity (Many people at OpenAI): Summarized in the highlights!
Variational Option Discovery Algorithms (Joshua Achiam et al): Summarized in the highlights!
Learning Plannable Representations with Causal InfoGAN (Thanard Kurutach, Aviv Tamar et al): Hierarchical reinforcement learning aims to learn a hierarchy of actions that an agent can take, each implemented in terms of actions lower in the hierarchy, in order to get more efficient planning. Another way we can achieve this is to use a classical planning algorithm to find a sequence of waypoints, or states that the agent should reach that will allow it to reach its goal. These waypoints can be thought of as a high-level plan. You can then use standard RL algorithms to figure out how to go from one waypoint to the next. However, typical planning algorithms that can produce a sequence of waypoints require very structured state representations, that were designed by humans in the past. How can we learn them directly from data? This paper proposes Causal InfoGAN. They use a GAN where the generator creates adjacent waypoints in the sequence, while the discriminator tries to distinguish between waypoints from the generator and pairs of points sampled from the true environment. This incentivizes the generator to generate waypoints that are close to each other, so that we can use an RL algorithm to learn to go from one waypoint to the next. However, this only lets us generate adjacent waypoints. In order to use this to make a sequence of waypoints that gets from a start state to a goal state, we need to use some classical planning algorithm. In order to do that, we need to have a structured state representation. GANs do not do this by default. InfoGAN tries to make the latent representation in a GAN more meaningful by providing the generator with a “code” (a state in our case) and maximizing the mutual information of the code and the output of the generator. In this setting, we want to learn representations that are good for planning, so we want to encode information about transitions between states. This leads to the Causal InfoGAN objective, where we provide the generator with a pair of abstract states (s, s’), have it generate a pair of observations (o, o’) and maximize the mutual information between (s, s’) and (o, o’), so that s and s’ become good low-dimensional representations of o and o’. They show that Causal InfoGAN can create sequences of waypoints in a rope manipulation task, that previously had to be done manually.
My opinion: We’re seeing more and more work combining classical symbolic approaches with the current wave of statistical machine learning from big data, that gives them the best of both worlds. While the results we see are not general intelligence, it’s becoming less and less true that you can point to a broad swath of capabilities that AI cannot do yet. I wouldn’t be surprised if a combination of symbolic and stastical AI techniques led to large capability gains in the next few years.
TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing (Augustus Odena et al)