Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

The Alignment Forum sequences have started again! As a reminder, treat them as though I had highlighted them.

Highlights

Reframing Superintelligence: Comprehensive AI Services as General Intelligence (Eric Drexler): This is a huge document; rather than summarize it all in this newsletter, I wrote up my summary in this post. For this newsletter, I’ve copied over the description of the model, but left out all of the implications and critiques.

The core idea is to look at the pathway by which we will develop general intelligence, rather than assuming that at some point we will get a superintelligent AGI agent. To predict how AI will progress in the future, we can look at how AI progresses currently—through research and development (R&D) processes. AI researchers consider a problem, define a search space, formulate an objective, and use an optimization technique in order to obtain an AI system, called a service, that performs the task.

A service is an AI system that delivers bounded results for some task using bounded resources in bounded time. Superintelligent language translation would count as a service, even though it requires a very detailed understanding of the world, including engineering, history, science, etc. Episodic RL agents also count as services.

While each of the AI R&D subtasks is currently performed by a human, as AI progresses we should expect that we will automate these tasks as well. At that point, we will have automated R&D, leading to recursive technological improvement. This is not recursive self-improvement, because the improvement comes from R&D services creating improvements in basic AI building blocks, and those improvements feed back into the R&D services. All of this should happen before we get any powerful AGI agents that can do arbitrary general reasoning.

Rohin’s opinion: I’m glad this has finally been published—it’s been informing my views for a long time now. I broadly buy the general view put forward here, with a few nitpicks that you can see in the post. I really do recommend you read at least the post—that’s just the summary of the report, so it’s full of insights, and it should be interesting to technical safety and strategy researchers alike.

I’m still not sure how this should affect what research we do—techniques like preference learning and recursive reward modeling seem applicable to CAIS as well, since they allow us to more accurately specify what we want each individual service to do.

Technical AI alignment

Iterated amplification sequence

Supervising strong learners by amplifying weak experts (Paul Christiano): This was previously covered in AN #30, I’ve copied the summary and opinion. This paper introduces iterated amplification, focusing on how it can be used to define a training signal for tasks that humans cannot perform or evaluate, such as designing a transit system. The key insight is that humans are capable of decomposing even very difficult tasks into slightly simpler tasks. So, in theory, we could provide ground truth labels for an arbitrarily difficult task by a huge tree of humans, each decomposing their own subquestion and handing off new subquestions to other humans, until questions are easy enough that a human can directly answer them.

We can turn this into an efficient algorithm by having the human decompose the question only once, and using the current AI system to answer the generated subquestions. If the AI isn’t able to answer the subquestions, then the human will get nonsense answers. However, as long as there are questions that the human + AI system can answer but the AI alone cannot answer, the AI can learn from the answers to those questions. To reduce the reliance on human data, another model is trained to predict the decomposition that the human performs. In addition, some tasks could refer to a large context (eg. evaluating safety for a specific rocket design), so they model the human as being able to access small pieces of the context at a time.

They evaluate on simple algorithmic tasks like distance between nodes in a graph, where they can program an automated human decomposition for faster experiments, and there is a ground truth solution. They compare against supervised learning, which trains a model on the ground truth answers to questions (which iterated amplification does not have access to), and find that they can match the performance of supervised learning with only slightly more training steps.

Rohin’s opinion: This is my new favorite post/paper for explaining how iterated amplification works, since it very succinctly and clearly makes the case for iterated amplification as a strategy for generating a good training signal. I’d recommend reading the paper in full, as it makes other important points that I haven’t included in the summary.

Note that it does not explain a lot of Paul’s thinking. It explains one particular training method that allows you to train an AI system with a more intelligent and informed overseer.

Value learning sequence

Will humans build goal-directed agents? (Rohin Shah): The previous post argued that coherence arguments do not mean that a superintelligent AI must have goal-directed behavior. In this post, I consider other arguments suggesting that we’ll build goal-directed AI systems.

- Since humans are goal-directed, they will build goal-directed AI to help them achieve their goals. Reaction: Somewhat agree, but this only shows that the human + AI system should be goal-directed, not the AI itself.

- Goal-directed AI can exceed human performance. Reaction: Mostly agree, but there could be alternatives that still exceed human performance.

- Current RL agents are goal-directed. Reaction: While the math says this, in practice this doesn’t seem true, since RL agents learn from experience rather than planning over the long term.

- Existing intelligent agents are goal-directed. Reaction: Seems like a good reason to not build AI using evolution.

- Goal-directed agents are more interpretable and so more desirable. Reaction: Disagree, it seems like we’re arguing that we should build goal-directed AI so that we can more easily predict that it will cause catastrophe.

AI safety without goal-directed behavior (Rohin Shah): The main thrust of the second chapter of the sequence is that it is not required for a superintelligent AI system to be goal-directed. While there are certainly economic arguments suggesting that we will build goal-directed AI, these do not have the force of a theorem. Given the strong arguments we’ve developed that goal-directed AI would likely be dangerous, it seems worth exploring other options. Some possibilities are AI systems that infer and follow norms, corrigible AI, and bounded and episodic AI services.

These other possibilities can be cast in a utility-maximization framework. However, if you do that then you are once again tempted to say that you are screwed if you get the utility function slightly wrong. Instead, I would want to build these systems in such a way that the desirable properties are inherent to the way that they reason, so that it isn’t even a coherent question to ask “what if we get it slightly wrong”.

Problems

Imitation learning considered unsafe? (capybaralet): We might hope that using imitation learning to mimic a corrigible human would be safe. However, this would involve mimicking the human’s planning process. It seems fairly likely that slight errors in the imitation of this process could lead to the creation of a goal-directed planning process that does dangerous long-term optimization.

Rohin’s opinion: This seems pretty similar to the problem of inner optimizers, in which while searching for a good policy for some task T on training distribution D, you end up finding a consequentialist agent that is optimizing some utility function that leads to good performance on D. That agent will have all the standard dangers of goal-directed optimization out of distribution.

Two More Decision Theory Problems for Humans (Wei Dai): The first problem is that any particular human’s values only make sense for the current environment. When considering different circumstances (eg. an astronomically large number of very slightly negative experiences like getting a dust speck in your eye), many people will not know how to evaluate the value of such a situation.

The second problem is that for most formalizations of values or utility functions, the values are defined relative to some way of making decisions in the world, or some ontology through which we understand the world. If this decision theory or ontology changes, it’s not clear how to “transfer” the values to the new version.

Predictors as Agents