Alignment Newsletter #23


Visual Reinforcement Learning with Imagined Goals (Vitchyr Pong and Ashvin Nair): This is a blog post explaining a paper by the same name that I covered in AN #16. It’s particularly clear and well-explained, and I continue to think the idea is cool and interesting. I’ve recopied my summary and opinion here, but you should read the blog post, it explains it very well.

Hindsight Experience Replay (HER) introduced the idea of accelerating learning with sparse rewards, by taking trajectories where you fail to achieve the goal (and so get no reward, and thus no learning signal) and replacing the actual goal with an “imagined” goal chosen in hindsight such that you actually achieved that goal, which means you get reward and can learn. This requires that you have a space of goals such that for any trajectory, you can come up with a goal such that the trajectory achieves that goal. In practice, this means that you are limited to tasks where the goals are of the form “reach this goal state”. However, if your goal state is an image, it is very hard to learn how to act in order to reach any possible image goal state (even if you restrict to realistic ones), since the space is so large and unstructured. The authors propose to first learn a structured latent representation of the space of images using a variational autoencoder (VAE), and then use that structured latent space as the space of goals which can be achieved. They also use Q-learning instead of DDPG (which is what HER used), so that they can imagine any goal with a minibatch (s, a, s’) and learn from it (whereas HER/​DDPG is limited to states on the trajectory).

My opinion: This is a cool example of a relatively simple yet powerful idea—instead of having a goal space over all states, learn a good latent representation and use that as your goal space. This enables unsupervised learning in order to figure out how to use a robot to generally affect the world, probably similarly to how babies explore and learn.

Impact Measure Desiderata (TurnTrout): This post gives a long list of desiderata that we might want an impact measure to satisfy. It considers the case where the impact measure is a second level of safety, that is supposed to protect us if we don’t succeed at value alignment. This means that we want our impact measure to be agnostic to human values. We’d also like it to be agnostic to goals, environments, and representations of the environment. There are several other desiderata—read the post for more details, my summary would just be repeating it.

My opinion: These seem like generally good desiderata, though I don’t know how to formalize them to the point that we can actually check with reasonable certainty whether a proposed impact measure meets these desiderata.

I have one additional desideratum from impact measures. The impact measure alone should disallow all extinction scenarios, while still allowing the AI system to do most of the things we use AI for today. This is rather weak, really I’d want AI do more tasks than are done today. However, even in this weak form, I doubt that we can satisfy this desideratum if we must also be agnostic to values, goals, representations and environments. We could have valued human superiority at game-playing very highly, in which case building AlphaGo would be catastrophic. How can an impact measure allow that without being at least some knowledge about values?

Recurrent World Models Facilitate Policy Evolution (David Ha et al): I read the interactive version of the paper. The basic idea is to do model-based reinforcement learning, where the model is composed of a variational auto-encoder that turns a high-dimensional state of pixels into a low-dimensional representation, and a large RNN that predicts how the (low-dimensional) state will evolve in the future. The outputs of this model are fed into a very simple linear controller that chooses actions. Since the controller is so simple, they can train it using a black box optimization method (an evolutionary strategy) that doesn’t require any gradient information. They evaluate on a racing task and on Doom, and set new state-of-the-art results. There are also other interesting setups—for example, once you have a world model, you can train the controller completely within the world model without interacting with the outside world at all (using the number of timesteps before the episode ends as your reward function, since the world model doesn’t predict standard rewards, but does predict whether the episode ends). There are a lot of cool visualizations that let you play with the models trained with their method.

My opinion: I agree with Shimon Whiteson’s take, which is that this method gets improvements by creating a separation of concerns between modelling the world and learning a controller for the model, and evaluating on environments where this separation mostly holds. A major challenge in RL is learning the features that are important for the task under consideration, and this method instead learns features that allow you to reconstruct the state, which could be very different, but happen to not be different in their environments. That said, I really like the presentation of the paper and the fact that they did ablation studies.

Previous newsletters

Model Reconstruction from Model Explanations (Smitha Milli et al): Back in AN #16, I said that one way to prevent model reconstruction from gradient-based explanations was to add noise to the gradients. Smitha pointed out that the experiments with SmoothGrad are actually of this form, and it still is possible to recover the full model, so even adding noise may not help. I don’t really understand SmoothGrad and it’s relationship with noise (which is chosen to make a saliency map look nice, if I understand correctly) so I don’t know exactly what to think here.

Technical AI alignment

Agent foundations

When wishful thinking works (Alex Mennen): Sometimes beliefs can be loopy, in that the probability of a belief being true depends on whether you believe it. For example, the probability that a placebo helps you may depend on whether you believe that a placebo helps you. In the situation where you know this, you can “wish” your beliefs to be the most useful possible beliefs. In the case where the “true probability” depends continuously on your beliefs, you can use a fixed point theorem to find a consistent set of probabilities. There may be many such fixed points, in which case you can choose the one that would lead to highest expected utility (such as choosing to believe in the placebo). One particular application of this would be to think of the propositions as “you will take action a_i”. In this case, you act the way you believe you act, and then every probability distribution over the propositions is a fixed point, and so we just choose the probability distribution (i.e. stochastic policy) that maximized expected utility, as usual. This analysis can also be carried to Nash equilibria, where beliefs in what actions you take will affect the actions that the other player takes.

Counterfactuals and reflective oracles (Nisan)

Learning human intent

Cycle-of-Learning for Autonomous Systems from Human Interaction (Nicholas R. Waytowich et al): We’ve developed many techniques for learning behaviors from humans in the last few years. This paper categorizes them as learning from demonstrations (think imitation learning and IRL), learning from intervention (think Safe RL via Human Intervention), and learning from evaluation (think Deep RL from Human Preferences). They propose running these techniques in sequence, followed by pure RL, to train a full system. Intuitively, demonstrations are used to jumpstart the learning, getting to near-human performance, and then intervention and evaluation based learning allow the system to safely improve beyond human-level, since it can learn behaviors that humans can’t perform themselves but can recognize as good, and then RL is used to improve even more.

My opinion: The general idea makes sense, but I wish they had actually implemented it and seen how it worked. (They do want to test in robotics in future work.) For example, they talk about inferring a reward with IRL from demonstrations, and then updating it during the intervention and evaluation stages. How are they planning to update it? Does the format of the reward function have to be the same in all stages, and will that affect how well each method works?

This feels like a single point in the space of possible designs, and doesn’t include all of the techniques I’d be interested in. What about active methods, combined with exploration methods in RL? Perhaps you could start with a hand-specified reward function, get a prior using inverse reward design, start optimizing it using RL with curiosity, and have a human either intervene when necessary (if you want safe exploration) or have the RL system actively query the human at certain states, where the human can respond with demonstrations or evaluations.

Sample-Efficient Imitation Learning via Generative Adversarial Nets (Lionel Blondé et al)

A Roadmap for the Value-Loading Problem (Lê Nguyên Hoang)

Preventing bad behavior

Impact Measure Desiderata (TurnTrout): Summarized in the highlights!

Handling groups of agents

Reinforcement Learning under Threats (Víctor Gallego et al): Due to lack of time, I only skimmed this paper for 5 minutes, but my general sense is that it takes MDPs and turns them into two player games by positing the presence of an adversary. It modifies the Bellman update equations to handle the adversary, but runs into the usual problems of simulating an adversary that simulates you. So, it formalizes level-k thinking (simulating an opponent that thinks about you at level k-1), and evaluates this on matrix games and the friend-or-foe environment from AI safety gridworlds.

My opinion: I’m not sure what this is adding over two-player game theory (for which we can compute equilibria) but again I only skimmed the paper so it’s quite likely that I missed something.

Near-term concerns

Adversarial examples

Adversarial Reprogramming of Sequence Classification Neural Networks (Paarth Neekhara et al)

Fairness and bias

Introducing the Inclusive Images Competition (Tulsee Doshi): The authors write, “this competition challenges you to use Open Images, a large, multilabel, publicly-available image classification dataset that is majority-sampled from North America and Europe, to train a model that will be evaluated on images collected from a different set of geographic regions across the globe”. The results will be presented at NIPS 2018 in December.

My opinion: I’m really interested in the techniques and results here, since there’s a clear, sharp distribution shift from the training set to the test set, which is always hard to deal with. Hopefully some of the entries will have general solutions which we can adapt to other settings.

AI strategy and policy

Podcast: Artificial Intelligence – Global Governance, National Policy, and Public Trust with Allan Dafoe and Jessica Cussins (Allan Dafoe, Jessica Cussins, and Ariel Conn): Topics discussed include the difference between AI governance and AI policy, externalities and solving them through regulation, whether governments and bureaucracies can keep up with AI research, the extent to which the US’ policy of not regulating AI may cause citizens to lose trust, labor displacement and inequality, and AI races.

Other progress in AI

Reinforcement learning

Visual Reinforcement Learning with Imagined Goals (Vitchyr Pong and Ashvin Nair): Summarized in the highlights!

Recurrent World Models Facilitate Policy Evolution (David Ha et al): Summarized in the highlights!

ARCHER: Aggressive Rewards to Counter bias in Hindsight Experience Replay (Sameera Lanka et al)

SOLAR: Deep Structured Latent Representations for Model-Based Reinforcement Learning (Marvin Zhang, Sharad Vikram et al)

ExpIt-OOS: Towards Learning from Planning in Imperfect Information Games (Andy Kitchen et al)

Miscellaneous (AI)

Making it easier to discover datasets (Natasha Noy): Google has launched Dataset Search, a tool that lets you search for datasets that you could then use in research.

My opinion: I imagine that this is primarily targeted at data scientists aiming to learn about the real world, and not ML researchers, but I wouldn’t be surprised if it was helpful for us as well. MNIST and ImageNet are both present, and a search for “self-driving cars” turned up some promising-looking links that I didn’t investigate further.