[AN #157]: Measuring misalignment in the technology underlying Copilot

Link post

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Evaluating Large Language Models Trained on Code (Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan et al) (summarized by Rohin): You’ve probably heard of GitHub Copilot, the programming assistant tool that can provide suggestions while you are writing code. This paper evaluates Codex, a precursor to the model underlying Copilot. There’s a lot of content here; I’m only summarizing what I see as the highlights.

The core ingredient for Codex was the many, many public repositories on GitHub, which provided hundreds of millions of lines of training data. With such a large dataset, the authors were able to get good performance by training a model completely from scratch, though in practice they finetuned an existing pretrained GPT model as it converged faster while providing similar performance.

Their primary tool for evaluation is HumanEval, a collection of 164 hand-constructed Python programming problems where the model is provided with a docstring explaining what the program should do along with some unit tests, and the model must produce a correct implementation of the resulting function. Problems are not all equally difficult; an easier problem asks Codex to “increment all numbers in a list by 1” while a harder one provides a function that encodes a string of text using a transposition cipher and asks Codex to write the corresponding decryption function.

To improve performance even further, they collect a sanitized finetuning dataset of problems formatted similarly to those in HumanEval and train Codex to perform well on such problems. These models are called Codex-S. With this, we see the following results:

1. Pretrained GPT models get roughly 0%.

2. The largest 12B Codex-S model succeeds on the first try 29% of the time. (A Codex model of the same size only gets roughly 22%.)

3. There is a consistent scaling law for reduction in loss. This translates into a less consistent graph for performance on the HumanEval dataset, where once the model starts to solve at least (say) 5% of the tasks, there is a roughly linear increase in the probability of success when doubling the size of the model.

4. If instead we generate 100 samples and check whether they pass the unit tests to select the best one, then Codex-S gets 78%. If we still generate 100 samples but select the sample that has the highest mean log probability (perhaps because we don’t have an exhaustive suite of unit tests), then we get 45%.

They also probe the model for bad behavior, including misalignment. In this context, they define misalignment as a case where the user wants A, but the model outputs B, and the model is both capable of outputting A and capable of distinguishing between cases where the user wants A and the user wants B.

Since Codex is trained primarily to predict the next token, it has likely learned that buggy code should be followed by more buggy code, that insecure code should be followed by more insecure code, and so on. This suggests that if the user accidentally provides examples with subtle bugs, then the model will continue to create buggy code, even though the user would want correct code. They find that exactly this effect occurs, and that the divergence between good and bad performance increases as the model size increases (presumably because larger models are better able to pick up on the correlation between previous buggy code and future buggy code).

Rohin’s opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

TECHNICAL AI ALIGNMENT


TECHNICAL AGENDAS AND PRIORITIZATION

Measurement, Optimization, and Take-off Speed (Jacob Steinhardt) (summarized by Sudhanshu): In this blogpost, the author argues that “trying to measure pretty much anything you can think of is a good mental move that is heavily underutilized in machine learning”. He motivates the value of measurement and additional metrics by (i) citing evidence from the history of science, policy-making, and engineering (e.g. x-ray crystallography contributed to rapid progress in molecular biology), (ii) describing how, conceptually, “measurement has several valuable properties” (one of which is to act as interlocking constraints that help to error-check theories), and (iii) providing anecdotes from his own research endeavours where such approaches have been productive and useful (see, e.g. Rethinking Bias-Variance Trade-off (AN #129)).

He demonstrates his proposal by applying it to the notion of optimization power—an important idea that has not been measured or even framed in terms of metrics. Two metrics are offered: (a) the change (typically deterioration) of performance when trained with a perturbed objective function with respect to the original objective function, named Outer Optimization, and (b) the change in performance of agents during their own lifetime (but without any further parameter updates), such as the log-loss on the next sentence for a language model after it sees X number of sequences at test time, or Inner Adaptation. Inspired by these, the article includes research questions and possible challenges.

He concludes with the insight that take-off would depend on these two continuous processes, Outer Optimization and Inner Adaptation, that work on very different time-scales, with the former being, at this time, much quicker than the latter. However, drawing an analogy from evolution, where it took billions of years of optimization to generate creatures like humans that were exceptional at rapid adaptation, we might yet see a fast take-off were Inner Adaptation turns out to be an exponential process that dominates capabilities progress. He advocates for early, sensitive measurement of this quantity as it might be an early warning sign of imminent risks.

Sudhanshu’s opinion: Early on, this post reminded me of Twenty Billion Questions; even though they are concretely different, these two pieces share a conceptual thread. They both consider the measurement of multiple quantities essential for solving their problems: 20BQ for encouraging AIs to be low-impact, and this post for productive framings of ill-defined concepts and as a heads-up about potential catastrophes.

Measurement is important, and this article poignantly argues why and illustrates how. It volunteers potential ideas that can be worked on today by mainstream ML researchers, and offers up a powerful toolkit to improve one’s own quality of analysis. It would be great to see more examples of this technique applied to other contentious, fuzzy concepts in ML and beyond. I’ll quickly note that while there seems to be minimal interest in this from academia, measurement of optimization power has been discussed earlier in several ways, e.g. Measuring Optimization Power, or the ground of optimization (AN #105).

Rohin’s opinion: I broadly agree with the perspective in this post. I feel especially optimistic about the prospects of measurement for (a) checking whether our theoretical arguments hold in practice and (b) convincing others of our positions (assuming that the arguments do hold in practice).

FORECASTING

Fractional progress estimates for AI timelines and implied resource requirements (Mark Xu et al) (summarized by Rohin): One methodology for forecasting AI timelines is to ask experts how much progress they have made to human-level AI within their subfield over the last T years. You can then extrapolate linearly to see when 100% of the problem will be solved. The post linked above collects such estimates, with a typical estimate being 5% of a problem being solved in the twenty year period between 1992 and 2012. Overall these estimates imply a timeline of 372 years.

This post provides a reductio argument against this pair of methodology and estimate. The core argument is that if you linearly extrapolate, then you are effectively saying “assume that business continues as usual: then how long does it take”? But “business as usual” in the case of the last 20 years involves an increase in the amount of compute used by AI researchers by a factor of ~1000, so this effectively says that we’ll get to human-level AI after a 1000^{372/​20} = 10^56 increase in the amount of available compute. (The authors do a somewhat more careful calculation that breaks apart improvements in price and growth of GDP, and get 10^53.)

This is a stupendously large amount of compute: it far dwarfs the amount of compute used by evolution, and even dwarfs the maximum amount of irreversible computing we could have done with all the energy that has ever hit the Earth over its lifetime (the bound comes from Landauer’s principle).

Given that evolution did produce intelligence (us), we should reject the argument. But what should we make of the expert estimates then? One interpretation is that “proportion of the problem solved” behaves more like an exponential, because the inputs are growing exponentially, and so the time taken to do the last 90% can be much less than 9x the time taken for the first 10%.

Rohin’s opinion: This seems like a pretty clear reductio to me, though it is possible to argue that this argument doesn’t apply because compute isn’t the bottleneck, i.e. even with infinite compute we wouldn’t know how to make AGI. (That being said, I mostly do think we could build AGI if only we had enough compute; see also last week’s highlight on the scaling hypothesis (AN #156).)

MISCELLANEOUS (ALIGNMENT)

Progress on Causal Influence Diagrams (Tom Everitt et al) (summarized by Rohin): Many of the problems we care about (reward gaming, wireheading, manipulation) are fundamentally a worry that our AI systems will have the wrong incentives. Thus, we need Causal Influence Diagrams (CIDs): a formal theory of incentives. These are graphical models (AN #49) in which there are action nodes (which the agent controls) and utility nodes (which determine what the agent wants). Once such a model is specified, we can talk about various incentives the agent has. This can then be used for several applications:

1. We can analyze what happens when you intervene on the agent’s action. Depending on whether the RL algorithm uses the original or modified action in its update rule, we may or may not see the algorithm disable its off switch.

2. We can avoid reward tampering (AN #71) by removing the connections from future rewards to utility nodes; in other words, we ensure that the agent evaluates hypothetical future outcomes according to its current reward function.

3. A multiagent version allows us to recover concepts like Nash equilibria and subgames from game theory, using a very simple, compact representation.

AI GOVERNANCE

A personal take on longtermist AI governance (Luke Muehlhauser) (summarized by Rohin): We’ve previously seen (AN #130) that Open Philanthropy struggles to find intermediate goals in AI governance that seem robustly good to pursue from a longtermist perspective. (If you aren’t familiar with longtermism, you probably want to skip to the next summary.) In this personal post, the author suggests that there are three key bottlenecks driving this:

1. There are very few longtermists in the world; those that do exist often don’t have the specific interests, skills, and experience needed for AI governance work. We could try to get others to work on relevant problems, but:

2. We don’t have the strategic clarity and forecasting ability to know which intermediate goals are important (or even net positive). Maybe we could get people to help us figure out the strategic picture? Unfortunately:

3. It’s difficult to define and scope research projects that can help clarify which intermediate goals are worth pursuing when done by people who are not themselves thinking about the issues from a longtermist perspective.

Given these bottlenecks, the author offers the following career advice for those who hope to do work from a longtermist perspective in AI governance:

1. Career decisions should be especially influenced by the value of experimentation, learning, aptitude development, and career capital.

2. Prioritize future impact, for example by building credentials to influence a 1-20 year “crunch time” period. (But make sure to keep studying and thinking about how to create that future impact.)

3. Work on building the field, especially with an eye to reducing bottleneck #1. (See e.g. here.)

4. Try to reduce bottleneck #2 by doing research that increases strategic clarity, though note that many people have tried this and it doesn’t seem like the situation has improved very much.

NEWS

Open Philanthropy Technology Policy Fellowship (Luke Muehlhauser) (summarized by Rohin): Open Philanthropy is seeking applicants for a US policy fellowship program focused on high-priority emerging technologies, especially AI and biotechnology. Application deadline is September 15.

Read more: EA Forum post

FEEDBACK

I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.