Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I’m always happy to hear feedback; you can send it to me by replying to this email.
Audio version here (may not be up yet).
Welcome to another special edition of the newsletter! In this edition, I summarize four conversations that AI Impacts had with researchers who were optimistic that AI safety would be solved “by default”. (Note that one of the conversations was with me.)
While all four of these conversations covered very different topics, I think there were three main points of convergence. First, we were relatively unconvinced by the traditional arguments for AI risk, and find discontinuities relatively unlikely. Second, we were more optimistic about solving the problem in the future, when we know more about the problem and have more evidence about powerful AI systems. And finally, we were more optimistic that as we get more evidence of the problem in the future, the existing ML community will actually try to fix that problem.
Conversation with Paul Christiano (Paul Christiano, Asya Bergal, Ronny Fernandez, and Robert Long) (summarized by Rohin): There can’t be too many things that reduce the expected value of the future by 10%; if there were, there would be no expected value left (ETA: see this comment). So, the prior that any particular thing has such an impact should be quite low. With AI in particular, obviously we’re going to try to make AI systems that do what we want them to do. So starting from this position of optimism, we can then evaluate the arguments for doom. The two main arguments: first, we can’t distinguish ahead of time between AIs that are trying to do the right thing, and AIs that are trying to kill us, because the latter will behave nicely until they can execute a treacherous turn. Second, since we don’t have a crisp concept of “doing the right thing”, we can’t select AI systems on whether they are doing the right thing.
However, there are many “saving throws”, or ways that the argument could break down, avoiding doom. Perhaps there’s no problem at all, or perhaps we can cope with it with a little bit of effort, or perhaps we can coordinate to not build AIs that destroy value. Paul assigns a decent amount of probability to each of these (and other) saving throws, and any one of them suffices to avoid doom. This leads Paul to estimate that AI risk reduces the expected value of the future by roughly 10%, a relatively optimistic number. Since it is so neglected, concerted effort by longtermists could reduce it to 5%, making it still a very valuable area for impact. The main way he expects to change his mind is from evidence from more powerful AI systems, e.g. as we build more powerful AI systems, perhaps inner optimizer concerns will materialize and we’ll see examples where an AI system executes a non-catastrophic treacherous turn.
Paul also believes that clean algorithmic problems are usually solvable in 10 years, or provably impossible, and early failures to solve a problem don’t provide much evidence of the difficulty of the problem (unless they generate proofs of impossibility). So, the fact that we don’t know how to solve alignment now doesn’t provide very strong evidence that the problem is impossible. Even if the clean versions of the problem were impossible, that would suggest that the problem is much more messy, which requires more concerted effort to solve but also tends to be just a long list of relatively easy tasks to do. (In contrast, MIRI thinks that prosaic AGI alignment is probably impossible.)
Note that even finding out that the problem is impossible can help; it makes it more likely that we can all coordinate to not build dangerous AI systems, since no one wants to build an unaligned AI system. Paul thinks that right now the case for AI risk is not very compelling, and so people don’t care much about it, but if we could generate more compelling arguments, then they would take it more seriously. If instead you think that the case is already compelling (as MIRI does), then you would be correspondingly more pessimistic about others taking the arguments seriously and coordinating to avoid building unaligned AI.
One potential reason MIRI is more doomy is that they take a somewhat broader view of AI safety: in particular, in addition to building an AI that is trying to do what you want it to do, they would also like to ensure that when the AI builds successors, it does so well. In contrast, Paul simply wants to leave the next generation of AI systems in at least as good a situation as we find ourselves in now, since they will be both better informed and more intelligent than we are. MIRI has also previously defined aligned AI as one that produces good outcomes when run, which is a much broader conception of the problem than Paul has. But probably the main disagreement between MIRI and ML researchers and that ML researchers expect that we’ll try a bunch of stuff, and something will work out, whereas MIRI expects that the problem is really hard, such that trial and error will only get you solutions that appear to work.
Rohin’s opinion: A general theme here seems to be that MIRI feels like they have very strong arguments, while Paul thinks that they’re plausible arguments, but aren’t extremely strong evidence. Simply having a lot more uncertainty leads Paul to be much more optimistic. I agree with most of this.
However, I do disagree with the point about “clean” problems. I agree that clean algorithmic problems are usually solved within 10 years or are provably impossible, but it doesn’t seem to me like AI risk counts as a clean algorithmic problem: we don’t have a nice formal statement of the problem that doesn’t rely on intuitive concepts like “optimization”, “trying to do something”, etc. This suggests to me that AI risk is more “messy”, and so may require more time to solve.
Conversation with Rohin Shah (Rohin Shah, Asya Bergal, Robert Long, and Sara Haxhia) (summarized by Rohin): The main reason I am optimistic about AI safety is that we will see problems in advance, and we will solve them, because nobody wants to build unaligned AI. A likely crux is that I think that the ML community will actually solve the problems, as opposed to applying a bandaid fix that doesn’t scale. I don’t know why there are different underlying intuitions here.
In addition, many of the classic arguments for AI safety involve a system that can be decomposed into an objective function and a world model, which I suspect will not be a good way to model future AI systems. In particular, current systems trained by RL look like a grab bag of heuristics that correlate well with obtaining high reward. I think that as AI systems become more powerful, the heuristics will become more and more general, but they still won’t decompose naturally into an objective function, a world model, and search. In addition, we can look at humans as an example: we don’t fully pursue convergent instrumental subgoals; for example, humans can be convinced to pursue different goals. This makes me more skeptical of traditional arguments.
I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using. Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us, but if we only consider AI systems that are (say) 10x more intelligent than us, they will probably still be using human-understandable concepts. This should make alignment and oversight of these systems significantly easier. For significantly stronger systems, we should be delegating the problem to the AI systems that are 10x more intelligent than us. (This is very similar to the picture painted in Chris Olah’s views on AGI safety (AN #72), but that had not been published and I was not aware of Chris’s views at the time of this conversation.)
I’m also less worried about race dynamics increasing accident risk than the median researcher. The benefit of racing a little bit faster is to have a little bit more power / control over the future, while also increasing the risk of extinction a little bit. This seems like a bad trade from each agent’s perspective. (That is, the Nash equilibrium is for all agents to be cautious, because the potential upside of racing is small and the potential downside is large.) I’d be more worried if [AI risk is real AND not everyone agrees AI risk is real when we have powerful AI systems], or if the potential upside was larger (e.g. if racing a little more made it much more likely that you could achieve a decisive strategic advantage).
Overall, it feels like there’s around 90% chance that AI would not cause x-risk without additional intervention by longtermists. The biggest disagreement between me and more pessimistic researchers is that I think gradual takeoff is much more likely than discontinuous takeoff (and in fact, the first, third and fourth paragraphs above are quite weak if there’s a discontinuous takeoff). If I condition on discontinuous takeoff, then I mostly get very confused about what the world looks like, but I also get a lot more worried about AI risk, especially because the “AI is to humans as humans are to ants” analogy starts looking more accurate. In the interview I said 70% chance of doom in this world, but with way more uncertainty than any of the other credences, because I’m really confused about what that world looks like. Two other disagreements, besides the ones above: I don’t buy Realism about rationality (AN #25), whereas I expect many pessimistic researchers do. I may also be more pessimistic about our ability to write proofs about fuzzy concepts like those that arise in alignment.
On timelines, I estimated a very rough 50% chance of AGI within 20 years, and 30-40% chance that it would be using “essentially current techniques” (which is obnoxiously hard to define). Conditional on both of those, I estimated 70% chance that it would be something like a mesa optimizer; mostly because optimization is a very useful instrumental strategy for solving many tasks, especially because gradient descent and other current algorithms are very weak optimization algorithms (relative to e.g. humans), and so learned optimization algorithms will be necessary to reach human levels of sample efficiency.
Rohin’s opinion: Looking over this again, I’m realizing that I didn’t emphasize enough that most of my optimism comes from the more outside view type considerations: that we’ll get warning signs that the ML community won’t ignore, and that the AI risk arguments are not watertight. The other parts are particular inside view disagreements that make me more optimistic, but they don’t factor in much into my optimism besides being examples of how the meta considerations could play out. I’d recommend this comment of mine to get more of a sense of how the meta considerations factor into my thinking.
I was also glad to see that I still broadly agree with things I said ~5 months ago (since no major new opposing evidence has come up since then), though as I mentioned above, I would now change what I place emphasis on.
Conversation with Robin Hanson (Robin Hanson, Asya Bergal, and Robert Long) (summarized by Rohin): The main theme of this conversation is that AI safety does not look particularly compelling on an outside view. Progress in most areas is relatively incremental and continuous; we should expect the same to be true for AI, suggesting that timelines should be quite long, on the order of centuries. The current AI boom looks similar to previous AI booms, which didn’t amount to much in the past.
Timelines could be short if progress in AI were “lumpy”, as in a FOOM scenario. This could happen if intelligence was one simple thing that just has to be discovered, but Robin expects that intelligence is actually a bunch of not-very-general tools that together let us do many things, and we simply have to find all of these tools, which will presumably not be lumpy. Most of the value from tools comes from more specific, narrow tools, and intelligence should be similar. In addition, the literature on human uniqueness suggests that it wasn’t “raw intelligence” or small changes to brain architecture that makes humans unique, it’s our ability to process culture (communicating via language, learning from others, etc).
In any case, many researchers are now distancing themselves from the FOOM scenario, and are instead arguing that AI risk occurs due to standard principal-agency problems, in the situation where the agent (AI) is much smarter than the principal (human). Robin thinks that this doesn’t agree with the existing literature on principal-agent problems, in which losses from principal-agent problems tend to be bounded, even when the agent is smarter than the principal.
You might think that since the stakes are so high, it’s worth working on it anyway. Robin agrees that it’s worth having a few people (say a hundred) pay attention to the problem, but doesn’t think it’s worth spending a lot of effort on it right now. Effort is much more effective and useful once the problem becomes clear, or once you are working with a concrete design; we have neither of these right now and so we should expect that most effort ends up being ineffective. It would be better if we saved our resources for the future, or if we spent time thinking about other ways that the future could go (as in his book, Age of Em).
It’s especially bad that AI safety has thousands of “fans”, because this leads to a “crying wolf” effect—even if the researchers have subtle, nuanced beliefs, they cannot control the message that the fans convey, which will not be nuanced and will instead confidently predict doom. Then when doom doesn’t happen, people will learn not to believe arguments about AI risk.
Rohin’s opinion: Interestingly, I agree with almost all of this, even though it’s (kind of) arguing that I shouldn’t be doing AI safety research at all. The main place I disagree is that losses from principal-agent problems with perfectly rational agents are bounded—this seems crazy to me, and I’d be interested in specific paper recommendations (though note I and others have searched and not found many).
On the point about lumpiness, my model is that there are only a few underlying factors (such as the ability to process culture) that allow humans to so quickly learn to do so many tasks, and almost all tasks require near-human levels of these factors to be done well. So, once AI capabilities on these factors reach approximately human level, we will “suddenly” start to see AIs beating humans on many tasks, resulting in a “lumpy” increase on the metric of “number of tasks on which AI is superhuman” (which seems to be the metric that people often use, though I don’t like it, precisely because it seems like it wouldn’t measure progress well until AI becomes near-human-level).
Conversation with Adam Gleave (Adam Gleave et al) (summarized by Rohin): Adam finds the traditional arguments for AI risk unconvincing. First, it isn’t clear that we will build an AI system that is so capable that it can fight all of humanity from its initial position where it doesn’t have any resources, legal protections, etc. While discontinuous progress in AI could cause this, Adam doesn’t see much reason to expect such discontinuous progress: it seems like AI is progressing by using more computation rather than finding fundamental insights. Second, we don’t know how difficult AI safety will turn out to be; he gives a probability of ~10% that the problem is as hard as (a caricature of) MIRI suggests, where any design not based on mathematical principles will be unsafe. This is especially true because as we get closer to AGI we’ll have many more powerful AI techniques that we can leverage for safety. Thirdly, Adam does expect that AI researchers will eventually solve safety problems; they don’t right now because it seems premature to work on those problems. Adam would be more worried if there were more arms race dynamics, or more empirical evidence or solid theoretical arguments in support of speculative concerns like inner optimizers. He would be less worried if AI researchers spontaneously started to work on relative problems (more than they already do).
Adam makes the case for AI safety work differently. At the highest level, it seems possible to build AGI, and some organizations are trying very hard to build AGI, and if they succeed it would be transformative. That alone is enough to justify some effort into making sure such a technology is used well. Then, looking at the field itself, it seems like the field is not currently focused on doing good science and engineering to build safe, reliable systems. So there is an opportunity to have an impact by pushing on safety and reliability. Finally, there are several technical problems that we do need to solve before AGI, such as how we get information about what humans actually want.
Adam also thinks that it’s 40-50% likely that when we build AGI, a PhD thesis describing it would be understandable by researchers today without too much work, but ~50% that it’s something radically different. However, it’s only 10-20% likely that AGI comes only from small variations of current techniques (i.e. by vastly increasing data and compute). He would see this as more likely if we hit additional milestones by investing more compute and data (OpenAI Five was an example of such a milestone).
Rohin’s opinion: I broadly agree with all of this, with two main differences. First, I am less worried about some of the technical problems that Adam mentions, such as how to get information about what humans want, or how to improve the robustness of AI systems, and more concerned about the more traditional problem of how to create an AI system that is trying to do what you want. Second, I am more bullish on the creation of AGI using small variations on current techniques, but vastly increasing compute and data (I’d assign ~30%, while Adam assigns 10-20%).