Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Pretrained Transformers as Universal Computation Engines (Kevin Lu et al) (summarized by Rohin): We’ve seen some very impressive few-shot learning results from GPT-3 (AN #102) and CLIP. These work by training a large Transformer model on a giant pile of data in a particular modality (such as language or images), and then we express tasks within that modality (e.g. summarization for a language model). This paper asks the question: could such models also help with tasks in a different modality? Surprisingly, the answer seems to be yes!

Specifically, the authors take the pretrained GPT-2 models and finetune on very different tasks, changing only the following parameters (which make up just ~0.1% of the model):

1. Input layer: This is a linear layer that transforms the input tokens before they go through the attention layers.

2. Output layer: This is a linear layer that uses the final representations to solve some downstream tasks.

3. Layer norm: These parameters are meant to mimic the statistics of the data distribution, and so need to be finetuned.

4. Positional embeddings. (They say that it only makes a slight difference to finetune these.)

For downstream tasks, they consider tasks like memorizing bit sequences, computing XORs, MNIST and CIFAR (where each image is represented as a sequence of 64 tokens, and each token is a 4x4 patch of the image), and protein folding. None of these tasks involve any use of natural language—the input modality is completely different.

The headline result: these sorts of models tend to achieve similar performance as Transformer models trained from scratch on the same tasks, and better performance than models initialized with random weights and then finetuned using the method above. This suggests that even for new data modalities the GPT-2 pretraining helps, suggesting that the model has learned some “universal computation” in its attention layers (hence the title). Note though that the differences from the random initialization are not that large (2-6 percentage points, except 25 percentage points in Bit Memory), suggesting that a lot of this might be the inductive bias of the Transformer architecture itself.

The rest of the paper delves into this more, running several experiments to learn more empirical facts. For example:

1. If the Transformers are pretrained on images instead of language, you do better on image tasks like CIFAR, but not as well on the other tasks.

2. Transformers do a lot better than LSTMs.

3. Pretrained Transformers also learn significantly faster than randomly initialized Transformers.

Rohin’s opinion: This is a pretty cool result. I’m not sure what I would have predicted ahead of time—the gains are small enough that I could believe I might have predicted them on a general basis of “probably training on realistic data gives you slightly better patterns of thought, so probably if you try hard enough you can find a small set of parameters to finetune that would work well”.

However, another possible line of reasoning would be “the attention heuristics learned for language would probably throw away lots of information if we applied them directly to the input tokens, and the input linear layer may not be enough to handle this issue, so probably this just destroys any good performance of the model”. I could see myself being convinced by that too.

TECHNICAL AI ALIGNMENT

TECHNICAL AGENDAS AND PRIORITIZATION

Alignment of Language Agents (Zachary Kenton et al) (summarized by Rohin): This paper analyzes the various problems we consider in AI alignment from the perspective of language agents. Problems covered include specification gaming (AN #1), whom and what to align to (AN #85), intent alignment (AN #33), removing tampering incentives (AN #126), and inner alignment (AN #58). These can be categorized as different kinds of misspecification, namely misspecification in the training data, the training process, and the behavior under distributional shift.

While the conceptual problems are similar to the ones already considered for embodied RL agents, the ways they manifest are different. In particular, the authors highlight the possibility that language agents will deceive us, manipulate us, or produce harmful content. The authors review some existing definitions of deception and manipulation that are purely behavioral (that is, the definitions do not require an intent to deceive or manipulate). A signaller deceives a receiver if the signaller transmits (or suggestively doesn’t transmit) a signal that causes the receiver to believe some false claim that benefits the signaller. Manipulation is similar, except rather than causing the receiver to believe a false claim, it causes the receiver to take some action that benefits the signaller, that in some sense the receiver “shouldn’t” have taken. We could cash out “the receiver ‘shouldn’t’ have taken the action” just as “the action is harmful to the receiver”, but from a safety / security mindset, the authors prefer a broader definition that aims to identify bad means of influencing the receiver, instead of only focusing on whether the ends were bad.

Some other miscellaneous points:

- Since the “action space” is just language, it seems like it should be easier (though still requires work) to prevent language agents from causing physical harm.

- It will hopefully be easier to train language agents to be explainable, since they have native fluency in natural language with which they can explain their behavior.

Read more: Paper: Alignment of Language Agents

FORECASTING

Measuring Mathematical Problem Solving With the MATH Dataset (Dan Hendrycks et al) (summarized by Rohin): We’ve seen GPT-3 (AN #102) perform well on lots of downstream tasks. What about challenging high school math problems that require intuition to solve? The authors create the MATH dataset and demonstrate that this is in fact challenging for models: models currently get around 5-7%, even when pretraining on a dataset of math-relevant text and finetuning on the MATH training dataset. Note that the models have to get the answer exactly right: there is no partial credit.

Not only are current models not very good at the task, but also they scale poorly—while there isn’t much data to extrapolate from yet, a simple extrapolation suggests that models would need 10^35 parameters to achieve just 40% accuracy. In contrast, in a simple study with university students, performance ranged between 40% and 90%, with the best human only making minor arithmetic errors. This suggests we’ll need additional algorithmic improvements for better performance.

The authors also consider allowing language models to have “scratch space” to work on the problem: the models are prompted to generate a solution where they explain their work. They find that this actually decreases accuracy, presumably because the poor generations at the beginning end up confusing the model.

Rohin’s opinion: While reading this paper, I kept stopping to do the math problems because, well, I’m just easily distracted by math problems. But it did demonstrate one thing—when the model gets it right, it can be really impressively right (at least in this one presumably cherry picked example). In one example from the paper (search for “ab5”), the ground-truth solution is horribly hacky, my housemate and I each separately got significantly more elegant solutions, but the model-generated solution was more elegant than either of our solutions. It’s a good example of how AI capabilities can be really lopsided—no human would generate this good of an explanation if they were getting 6% accuracy overall.

MISCELLANEOUS (ALIGNMENT)

My AGI Threat Model: Misaligned Model-Based RL Agent (Steve Byrnes) (summarized by Rohin): This post lays out a pathway by which an AI-induced existential catastrophe could occur. The author suggests that AGI will be built via model-based reinforcement learning: given a reward function, we will learn a world model, a value function, and a planner / actor. These will learn online, that is, even after being deployed these learned models will continue to be updated by our learning algorithm (gradient descent, or whatever replaces it). Most research effort will be focused on learning these models, with relatively less effort applied to choosing the right reward function.

There are then two alignment problems: the outer alignment problem is whether the reward function correctly reflects the designer’s intent, and the inner alignment problem is whether the value function accurately represents the expected reward obtained by the agent over the long term. On the inner alignment side, the value function may not accurately capture the reward for several reasons, including ambiguity in the reward signals (since you only train the value function in some situations, and many reward functions can then produce the same value function), manipulation of the reward signal, failures of credit assignment, ontological crises, and having mutually contradictory “parts” of the value function (similarly to humans). On the outer alignment side, we have the standard problem that the reward function may not reflect what we actually want (i.e. specification gaming or Goodhart’s Law). In addition, it seems likely that many capability enhancements will be implemented through the reward function, e.g. giving the agent a curiosity reward, which increases outer misalignment.

Rohin’s opinion: While I disagree on some of the details, I think this is a good threat model to be thinking about. Its main virtue is that it has a relatively concrete model for what AGI looks like, and it provides a plausible story for both how that type of AGI could be developed (the development model) and how that type of AGI would lead to problems (the risk model). Of course, it is still worth clarifying the plausibility of the scenario, as updates to the story can have significant implications on what research we do. (Some of this discussion is happening in this post.)

OTHER PROGRESS IN AI

MISCELLANEOUS (AI)

2021 AI Index Report (Daniel Zhang et al) (summarized by Zach): The AI Index Report is a project to track and distill data related to artificial intelligence. One central theme the report focuses on is the effects of COVID on AI research direction. The report highlights significant increases in spending on drug development, 4.5 times that in 2019. The report also focuses a spotlight on the relative lack of AI ethics benchmarks. This could pose a significant problem as surveillance technologies become an increasingly mature technology. Beyond these broad themes, there’s data on publication trends, politics, diversity, and more in the 222-page report. Additionally, a significant amount of data is publicly available or interactive.

Read more: Full report PDF

Zach’s opinion: This is well presented and you can glean a lot from looking at the introductory sections. If you choose to dive into a particular topic, charts and methodology are presented in a clear manner with nice hyperlinking to make navigation relatively painless. There is also an interactive visualization that allows for cross-country comparison according to user-defined metrics. Once again, very well presented.

NEWS

Stanford Existential Risks Conference (SERI) (summarized by Rohin): This conference on existential risks will run April 17-18. Applications to attend close April 12. There will be no charge to attend the conference.

Research Engineer, Safety (OpenAI) (summarized by Rohin): The Applied Safety team at OpenAI is looking to hire a research engineer, and explicitly states that the job is about safety of general-purpose AI systems (as opposed to narrow AI systems like autonomous vehicles).

FEEDBACK

I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

[AN #144]: How language models can also be finetuned for non-language tasks