[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison

Link post

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I’m always happy to hear feedback; you can send it to me by replying to this email.

Merry Christmas!

Audio version here (may not be up yet).

Highlights

2019 AI Alignment Literature Review and Charity Comparison (Larks) (summarized by Rohin): As in three previous years (AN #38), this mammoth post goes through the work done within AI alignment from December 2018 - November 2019, from the perspective of someone trying to decide which of several AI alignment organizations to donate to. As part of this endeavor, Larks summarizes several papers that were published at various organizations, and compares them to their budget and room for more funding.

Rohin’s opinion: I look forward to this post every year. This year, it’s been a stark demonstration of how much work doesn’t get covered in this newsletter—while I tend to focus on the technical alignment problem, with some focus on AI governance and AI capabilities, Larks’s literature review spans many organizations working on existential risk, and as such has many papers that were never covered in this newsletter. Anyone who wants to donate to an organization working on AI alignment and/​or x-risk should read this post. However, if your goal is instead to figure out what the field has been up to for the last year, for the sake of building inside view models of what’s happening in AI alignment, I might soon write up such an overview myself, but no promises.

Seeking Power is Provably Instrumentally Convergent in MDPs (Alex Turner et al) (summarized by Rohin): The Basic AI Drives argues that it is instrumentally convergent for an agent to collect resources and gain power. This post and associated paper aim to formalize this argument. Informally, an action is instrumentally convergent if it is helpful for many goals, or equivalently, an action is instrumentally convergent to the extent that we expect an agent to take it, if we do not know what the agent’s goal is. Similarly, a state has high power if it is easier to achieve a wide variety of goals from that state.

A natural formalization is to assume we have a distribution over the agent’s goal, and define power and instrumental convergence relative to this distribution. We can then define power as the expected value that can be obtained from a state (modulo some technical caveats), and instrumental convergence as the probability that an action is optimal, from our perspective of uncertainty: of course, the agent knows its own goal, and acts optimally in pursuit of that goal.

You might think that optimal agents would provably seek out states with high power. However, this is not true. Consider a decision faced by high school students: should they take a gap year, or go directly to college? Let’s assume college is necessary for (100-ε)% of careers, but if you take a gap year, you could focus on the other ε% of careers or decide to go to college after the year. Then in the limit of farsightedness, taking a gap year leads to a more powerful state, since you can still achieve all of the careers, albeit slightly less efficiently for the college careers. However, if you know which career you want, then it is (100-ε)% likely that you go to college, so going to college is very strongly instrumentally convergent even though taking a gap year leads to a more powerful state.

Nonetheless, there are things we can prove. In environments where the only cycles are states with a single action leading back to the same state, and apart from that every action leads to a new state, and many states have more than one action, farsighted agents are more likely to choose trajectories that spend more time navigating to a cycle before spending the rest of the time in the cycle. For example, in Tic-Tac-Toe where the opponent is playing optimally according to the normal win condition, but the agent’s reward for each state is drawn independently from some distribution on [0, 1], the agent is much more likely to play out to a long game where the entire board is filled. This is because the number of states that can be reached grows exponentially in the horizon, and so agents have more control by taking longer trajectories. Equivalently, the cycle with maximal reward is much more likely to be at the end of a longer trajectory, and so the optimal possibility is more likely to be a long trajectory.

Rohin’s opinion: I like the formalizations of power and instrumental convergence. I think in practice there will be a lot of complexity in a) the reward distribution that power and instrumental convergence are defined relative to, b) the structure of the environment, and c) how powerful AI systems actually work (since they won’t be perfectly optimal, and won’t know the environment structure ahead of time). Nonetheless, results with specific classes of reward distributions, environment structures, and agent models can still provide useful intuition.

Read more: Clarifying Power-Seeking and Instrumental Convergence, Paper: Optimal Farsighted Agents Tend to Seek Power

Technical AI alignment

Technical agendas and prioritization

A dilemma for prosaic AI alignment (Daniel Kokotajlo) (summarized by Rohin): This post points out a potential problem for Prosaic AI alignment (AN #34), in which we try to align AI systems built using current techniques. Consider some prosaic alignment scheme, such as iterated amplification (AN #30) or debate (AN #5). If we try to train an AI system directly using such a scheme, it will likely be uncompetitive, since it seems likely that the most powerful AI systems will probably require cutting-edge algorithms, architectures, objectives, and environments, at least some of which will be replaced by new versions from the safety scheme. Alternatively, we could first train a general AI system, and then use our alignment scheme to finetune it into an aligned AI system. However, this runs the risk that the initial training could create a misaligned mesa optimizer, that then deliberately sabotages our finetuning efforts.

Rohin’s opinion: The comments reveal a third possibility: the alignment scheme could be trained jointly alongside the cutting edge AI training. For example, we might hope that we can train a question answerer that can answer questions about anything “the model already knows”, and this question answering system is trained simultaneously with the training of the model itself. I think this takes the “oomph” out of the dilemma as posed here—it seems reasonably likely that it only takes fractionally more resources to train a question answering system on top of the model, if it only has to use knowledge “already in” the model, which would let it be competitive, while still preventing mesa optimizers from arising (if the alignment scheme does its job). Of course, it may turn out that it takes a huge amount of resources to train the question answering system, making the system uncompetitive, but that seems hard to predict given our current knowledge.

Technical AGI safety research outside AI (Richard Ngo) (summarized by Rohin): This post lists 30 questions relevant to technical AI safety that could benefit from expertise outside of AI, divided into four categories: studying and understanding safety problems, solving safety problems, forecasting AI, and meta.

Mesa optimization

Is the term mesa optimizer too narrow? (Matthew Barnett) (summarized by Rohin): The mesa optimization (AN #58) paper defined an optimizer as a system that internally searches through a search space for elements that score high according to some explicit objective function. However, humans would not qualify as mesa optimizers by this definition, since there (presumably) isn’t some part of the brain that explicitly encodes some objective function that we then try to maximize. In addition, there are inner alignment failures that don’t involve mesa optimization: a small feedforward neural net doesn’t do any explicit search; yet when it is trained in the chest and keys environment (AN #67), it learns a policy that goes to the nearest key, which is equivalent to a key-maximizer. Rather than talking about “mesa optimizers”, the post recommends that we instead talk about “malign generalization”, to refer to the problem when capabilities generalize but the objective doesn’t (AN #66).

Rohin’s opinion: I strongly agree with this post (though note that the post was written right after a conversation with me on the topic, so this isn’t independent evidence). I find it very unlikely that most powerful AI systems will be optimizers as defined in the original paper, but I do think that the malign generalization problem will apply to our AI systems. For this reason, I hope that future research doesn’t specialize to the case of explicit-search-based agents.

Learning human intent

Positive-Unlabeled Reward Learning (Danfei Xu et al) (summarized by Zach): The problem with learning a reward model and training an agent on the (now fixed) model is that the agent can learn to exploit errors in the reward model. Adversarial imitation learning seeks to avoid this by training a discriminator reward model with the agent: the discriminator is trained via supervised learning to distinguish between expert trajectories and agent trajectories, while the agent tries to fool the discriminator. However, this effectively treats the agent trajectories as negative examples — even once the agent has mastered the task. What we would really like to do is to treat the agent trajectories as unlabeled data. This is an instance of semi-supervised learning, in which a classifier has access to a small set of labeled data and a much larger collection of unlabeled data. In general, the common approach is to propagate classification information learned using labels to the unlabeled dataset. The authors apply a recent algorithm for positive-unlabeled (PU) learning, and show that this approach can improve upon both GAIL and supervised reward learning.

Zach’s opinion: I liked this paper because it offers a novel solution to a common concern with the adversarial approach. Namely, GAN approaches often train discriminators that overpower the generator leading to mode collapse. In the RL setting, it seems natural to leave agent generated trajectories unlabeled since we don’t have any sort of ground truth for whether or not agent trajectories are successful. For example, it might be possible to perform a task in a way that’s different than is shown in the demonstrations. In this case, it makes sense to try and propagate feedback to the larger unlabeled agent trajectory data set indirectly. Presumably, this wasn’t previously possible because positive-unlabeled learning has only recently been generalized to the deep learning setting. After reading this paper, my broad takeaway is that semi-supervised methods are starting to reach the point where they have potential to further progress in imitation learning.

Miscellaneous (Alignment)

What are some non-purely-sampling ways to do deep RL? (Evan Hubinger) (summarized by Matthew): A deep reinforcement learning agent trained by reward samples alone may predictably lead to a proxy alignment issue: the learner could fail to develop a full understanding of what behavior it is being rewarded for, and thus behave unacceptably when it is taken off its training distribution. Since we often use explicit specifications to define our reward functions, Evan Hubinger asks how we can incorporate this information into our deep learning models so that they remain aligned off the training distribution. He names several possibilities for doing so, such as giving the deep learning model access to a differentiable copy of the reward function during training, and fine-tuning a language model so that it can map natural language descriptions of a reward function into optimal actions.

Matthew’s opinion: I’m unsure, though leaning skeptical, whether incorporating a copy of the reward function into a deep learning model would help it learn. My guess is that if someone did that with a current model it would make the model harder to train, rather than making anything easier. I will be excited if someone can demonstrate at least one feasible approach to addressing proxy alignment that does more than sample the reward function.

Rohin’s opinion: I’m skeptical of this approach. Mostly this is because I’m generally skeptical that an intelligent agent will consist of a separate “planning” part and “reward” part. However, if that were true, then I’d think that this approach could plausibly give us some additional alignment, but can’t solve the entire problem of inner alignment. Specifically, the reward function encodes a huge amount of information: it specifies the optimal behavior in all possible situations you could be in. The “intelligent” part of the net is only ever going to get a subset of this information from the reward function, and so its plans can never be perfectly optimized for that reward function, but instead could be compatible with any reward function that would provide the same information on the “queries” that the intelligent part has produced.

For a slightly-more-concrete example, for any “normal” utility function U, there is a utility function U’ that is “like U, but also the best outcomes are ones in which you hack the memory so that the ‘reward’ variable is set to infinity”. To me, wireheading is possible because the “intelligent” part doesn’t get enough information about U to distinguish U from U’, and so its plans could very well be optimized for U’ instead of U.

Other progress in AI

Reinforcement learning

Model-Based Reinforcement Learning: Theory and Practice (Michael Janner et al) (summarized by Rohin): This post provides a broad overview of model-based reinforcement learning, and argues that a learned (explicit) model allows you to generate sample trajectories from the current policy at arbitrary states, correcting for off-policy error, at the cost of introducing model bias. Since model errors compound as you sample longer and longer trajectories, the authors propose an algorithm in which the model is used to sample short trajectories from states in the replay buffer, rather than sampling trajectories from the initial state (which are as long as the task’s horizon).

Read more: Paper: When to Trust Your Model: Model-Based Policy Optimization

Deep learning

Inductive biases stick around (Evan Hubinger) (summarized by Rohin): This update to Evan’s double descent post (AN #77) explains why he thinks double descent is important. Specifically, Evan argues that it shows that inductive biases matter even for large, deep models. In particular, double descent shows that larger models are simpler than smaller models, at least in the overparameterized setting where models are past the interpolation threshold where they can get approximately zero training error. This makes the case for mesa optimization (AN #58) stronger, since mesa optimizers are simple, compressed policies.

Rohin’s opinion: As you might have gathered last week, I’m not sold on double descent as a clear, always-present phenomenon, though it certainly is a real effect that occurs in at least some situations. So I tend not to believe counterintuitive conclusions like “larger models are simpler” that are premised on double descent.

Regardless, I expect that powerful AI systems are going to be severely underparameterized, and so I don’t think it really matters that past the interpolation threshold larger models are simpler. I don’t think the case for mesa optimization should depend on this; humans are certainly “underparameterized”, but should count as mesa optimizers.

The Quiet Semi-Supervised Revolution (Vincent Vanhoucke) (summarized by Flo): Historically, semi-supervised learning that uses small amounts of labelled data combined with a lot of unlabeled data only helped when there was very little labelled data available. In this regime, both supervised and semi-supervised learning were too inaccurate to be useful. Furthermore, approaches like using a representation learnt by an autoencoder for classification empirically limited asymptotic performance. This is strange because using more data should not lead to worse performance.

Recent trends suggest that this might change soon: semi-supervised systems have begun to outperform supervised systems by larger and larger margins in the low data regime and their advantage now extends into regimes with more and more data. An important driver of this trend is the idea of using data augmentation for more consistent self-labelling.

Better semi-supervised learning might for example be useful for federated learning which attempts to respect privacy by learning locally on (labelled) user data and sending the models trained by different users to be combined in a central server. One problem with this approach is that the central model might memorize some of the private models’ idiosyncracies such that inference about the private labels is possible. Semi-supervised learning makes this harder by reducing the amount of influence private data has on the aggregate model.

Flo’s opinion: Because the way humans classify things are strongly influenced by our priors about how classes “should” behave, learning with limited data most likely requires some information about these priors. Semi-supervised learning that respects that data augmentation does not change the correct classification might be an efficient and scalable way to force some of these priors onto a model. Thus it seems likely that more diverse and sophisticated data augmentation could lead to further improvements in the near term. On the other hand, it seems like a lot of our priors would be very hard to capture only using automatic data augmentation, such that other methods to transfer our priors are still important.