Review of AI Alignment Progress

Link post

I’m having trouble keeping track of everything I’ve learned about AI and AI alignment in the past year or so. I’m writing this post in part to organize my thoughts, and to a lesser extent I’m hoping for feedback about what important new developments I’ve been neglecting. I’m sure that I haven’t noticed every development that I would consider important.

I’ve become a bit more optimistic about AI alignment in the past year or so.

I currently estimate a 7% chance AI will kill us all this century. That’s down from estimates that fluctuated from something like 10% to 40% over the past decade. (The extent to which those numbers fluctuate implies enough confusion that it only takes a little bit of evidence to move my estimate a lot.)

I’m also becoming more nervous about how close we are to human level and transformative AGI. Not to mention feeling uncomfortable that I still don’t have a clear understanding of what I mean when I say human level or transformative AGI.

Shard Theory

Shard theory is a paradigm that seems destined to replace the focus (at least on LessWrong) on utility functions as a way of describing what intelligent entities want.

I kept having trouble with the plan to get AIs to have utility functions that promote human values.

Human values mostly vary in response to changes in the environment. I can make a theoretical distinction between contingent human values and the kind of fixed terminal values that seem to belong in a utility function. But I kept getting confused when I tried to fit my values, or typical human values, into that framework. Some values seem clearly instrumental and contingent. Some values seem fixed enough to sort of resemble terminal values. But whenever I try to convince myself that I’ve found a terminal value that I want to be immutable, I end up feeling confused.

Shard theory tells me that humans don’t have values that are well described by the concept of a utility function. Probably nothing will go wrong if I stop hoping to find those terminal values.

We can describe human values as context-sensitive heuristics. That will likely also be true of AIs that we want to create.

I feel deconfused when I reject utility functions, in favor of values being embedded in heuristics and/​or subagents.

Some of the posts that better explain these ideas:

Do What I Mean

I’ve become a bit more optimistic that we’ll find a way to tell AIs things like “do what humans want”, have them understand that, and have them obey.

GPT3 has a good deal of knowledge about human values, scattered around in ways that limit the usefulness of that knowledge.

LLMs show signs of being less alien than theory, or evidence from systems such as AlphaGo, led me to expect. Their training causes them to learn human concepts pretty faithfully.

That suggests clear progress toward AIs understanding human requests. That seems to be proceeding a good deal faster than any trend toward AIs becoming agenty.

However, LLMs suggest that it will be not at all trivial to ensure that AIs obey some set of commands that we’ve articulated. Much of the work done by LLMs involves simulating a stereotypical human. That puts some limits on how far they stray from what we want. But the LLM doesn’t have a slot where someone could just drop in Asimov’s Laws so as to cause the LLM to have those laws as its goals.

The post Retarget The Search provides a little hope that this might become easy. I’m still somewhat pessimistic about this.

Interpretability

Interpretability feels more important than it felt a few years ago. It also feels like it depends heavily on empirical results from AGI-like systems.

I see more signs than I expected that interpretability research is making decent progress.

The post that encouraged me most was How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme. TL;DR: neural networks likely develop simple representations of whether their beliefs are truth or false. The effort required to detect those representations does not seem to increase much with increasing model size.

Other promising ideas:

I’m currently estimating a 40% chance that before we get existentially risky AI, neural nets will be transparent enough to generate an expert consensus about which AIs are safe to deploy. A few years ago, I’d have likely estimated a 15% chance of that. An expert consensus seems somewhat likely to be essential if we end up needing pivotal processes.

Foom

We continue to accumulate clues about takeoff speeds. I’m becoming increasingly confident that we won’t get a strong or unusually dangerous version of foom.

Evidence keeps accumulating that intelligence is compute-intensive. That means replacing human AI developers with AGIs won’t lead to dramatic speedups in recursive self-improvement.

Recent progress in LLMs suggest there’s an important set of skills for which AI improvement slows down as it reaches human levels, because it is learning by imitating humans. But keep in mind that there are also important dimensions on which AI easily blows past the level of an individual human (e.g. breadth of knowledge), and will maybe slow down as it matches the ability of all humans combined.

LLMs also suggest that AI can become as general-purpose as humans while remaining less agentic /​ consequentialist. LLMs have outer layers that are fairly myopic, aiming to predict a few thousand words of future text.

The agents that an LLM simulates are more far-sighted. But there are still major obstacles to them implementing long-term plans: they almost always get shut down quickly, so it would take something unusual for them to run long enough to figure out what kind of simulation they’re in and to break out.

This doesn’t guarantee they won’t become too agentic, but I suspect they’d first need to become much more capable than humans.

Evidence is also accumulating that existing general approaches will be adequate to produce AIs that exceed human abilities at most important tasks. I anticipate several more innovations at the level of RELU and the transformer architecture, in order to improve scaling.

That doesn’t rule out the kind of major architectural breakthrough that could cause foom. But it’s hard to see a reason for predicting such a breakthrough. Extrapolations of recent trends tell me that AI is likely to transform the world in the 2030s. Whereas if foom is going to happen, I see no way to predict whether it will happen soon.

Self Concept

Nintil’s analysis of AI risk:

GPT3 is provided as an example of something that has some knowledge that could theoretically bear on situational awareness but I don’t think this goes far (It seems it has no self-concept at all); it is one thing to know about the world in general, and it is another very different to infer that you are an agent being trained. I can imagine a system that could do general purpose science and engineering without being either agentic or having a self-concept. … A great world model that comes to be by training models the way we do now need not give rise to a self-concept, which is the problematic thing.

I think it’s rather likely that smarter-than-human AGIs will tend to develop self-concepts. But I’m not too clear on when or how this will happen. In fact, the embedded agency discussions seem to hint that it’s unnatural for a designed agent to have a self-concept.

Can we prevent AIs from developing a self-concept? Is this a valuable thing to accomplish?

My shoulder Eliezer says that AIs with a self-concept will be more powerful (via recursive self-improvement), so researchers will be pressured to create them. My shoulder Eric Drexler replies that those effects are small enough that researchers can likely be deterred from creating such AIs for a nontrivial time.

I’d like to see more people analyzing this topic.

Social Influences

Leading AI labs do not seem to be on a course toward a clear-cut arms race.

Most AI labs see enough opportunities in AI that they expect most AI companies to end up being worth anywhere from $100 million to $10 trillion. A worst-case result of being a $100 million company is a good deal less scary than the typical startup environment, where people often expect a 90% chance of becoming worthless and needing to start over again. Plus, anyone competent enough to help create an existentially dangerous AI seems likely to have many opportunities to succeed if their current company fails.

Not too many investors see those opportunities, but there are more than a handful of wealthy investors who are coming somewhat close to indiscriminately throwing money at AI companies. This seems likely to promote an abundance mindset among serious companies that will dampen urges to race against other labs for first place at some hypothetical finish line. Although there’s a risk that this will lead to FTX-style overconfidence.

The worst news of 2022 is that the geopolitical world is heading toward another cold war. The world is increasingly polarized into a conflict between the West and the parts of the developed world that resist Western culture.

The US government is preparing to cripple China.

Will that be enough to cause a serious race between the West and China to develop the first AGI? If AGI is 5 years away, I don’t see how the US government is going to develop that AGI before a private company does. But with 15 year timelines, the risks of a hastily designed government AGI look serious.

Much depends on whether the US unites around concerns about China defeating the US. It seems not too likely that China would either develop AGI faster than the US, or use AGI to conquer territories outside of Asia. But it’s easy for a country to mistakenly imagine that it’s in a serious arms race.

I’m guessing the best publicly known AIs are replicating something like 8% of human cognition versus 2.5% 5 years ago. That’s in systems that are available to the public—I’m guessing those are a year or two behind what’s been developed but is still private.

Is that increasing linearly? Exponentially? I’m guessing it’s closer to exponential growth than linear growth, partly because it grew for decades in order to get to that 2.5%.

This increase will continue to be underestimated by people who aren’t paying close attention.

Advances are no longer showing up as readily quantifiable milestones (beating go experts). Instead, key advances are more like increasing breadth of abilities. I don’t know of good ways to measure that other than “jobs made obsolete”, which is not too well quantified, and likely lagging a couple of years behind the key technical advances.

I also see a possible switch from overhype to underhype. Up to maybe 5 years ago, AI companies and researchers focused a good deal on showing off their expertise, in order to hire or be hired by the best. Now the systems they’re working on are likely valuable enough that trade secrets will start to matter.

This switch is hard for most people to notice, even with ideal news sources. The storyteller industry obfuscates this further, by biasing stories to sound like the most important development of the day. So when little is happening, they exaggerate the story importance. But they switch to understating the importance when preparing for an emergency deserves higher priority than watching TV (see my Credibility of Hurricane Warnings).

Concluding Thoughts

I’m optimistic in the sense that I think that smart people are making progress on AI alignment, and that success does not look at all hopeless.

But I’m increasingly uncomfortable about how fast AGI is coming, how foggy the path forward looks, and how many uncertainties remain.