The Plan − 2025 Update

What’s “The Plan”?

For several years now, around the end of the year, I (John) write a post on our plan for AI alignment. That plan hasn’t changed too much over the past few years, so both this year’s post and last year’s are written as updates to The Plan − 2023 Version.

I’ll give a very quick outline here of what’s in the 2023 Plan post. If you have questions or want to argue about points, you should probably go to that post to get the full version.

So, how’s progress? What are you up to?

2023 and 2024 were mostly focused on Natural Latents—we’ll talk more shortly about that work and how it fits into the bigger picture. In 2025, we did continue to put out some work on natural latents, but our main focus has shifted.

Natural latents are a major foothold on understanding natural abstraction. One could reasonably argue that they’re the only rigorous foothold on the core problem to date, the first core mathematical piece of the future theory. We’ve used that foothold to pull ourselves up a bit, and can probably pull ourselves up a little further on it, but there’s more still to climb after that.

We need to figure out the next foothold.

That’s our main focus at this point. It’s wide open, very exploratory. We don’t know yet what that next foothold will look like. But we do have some sense of what problems remain, and what bottlenecks the next footholds need to address. That will be the focus of the rest of this post.

What are the next bottlenecks to understanding natural abstraction?

We see two main “prongs” to understanding natural abstraction: the territory-first prong, and the mind-first prong. These two have different bottlenecks, and would likely involve different next footholds. That said, progress on either prong makes the other much easier.

What’s the “territory-first prong”?

One canonical example of natural abstraction comes from the ideal gas (and gasses pretty generally, but ideal gas is the simplest).

We have a bunch of little molecules bouncing around in a box. The motion is chaotic: every time two molecules collide, any uncertainty in their velocity is amplified multiplicatively. So if an observer has any uncertainty in the initial conditions (which even a superintelligence would, for a real physical system), that uncertainty will grow exponentially over time, until all information is wiped out… except for conserved quantities, like e.g. the total energy of the molecules, the number of molecules, or the size of the box. So, after a short time, the best predictions our observer will be able to make about the gas will just be equivalent to using a Maxwell-Boltzmann distribution, conditioning on only the total energy (or equivalently temperature), number of particles, and volume. It doesn’t matter if the observer is a human or a superintelligence or an alien, it doesn’t matter if they have a radically different internal mind-architecture than we do; it is a property of the physical gas that those handful of parameters (energy, particle count, volume) summarize all the information which can actually be used to predict anything at all about the gas’ motion after a relatively-short time passes.

The key point about the gas example is that it doesn’t talk much about any particular mind. It’s a story about how a particular abstraction is natural (e.g. the energy of a gas), and that story mostly talks about properties of the physical system (e.g. chaotic dynamics wiping out all signal except the energy), and mostly does not talk about properties of a particular mind. Thus, “territory-first”.

More generally: the territory-first prong is about looking for properties of (broad classes of) physical systems, which make particular abstractions uniquely well-suited to those systems. Just like (energy, particle count, volume) is an abstraction well-suited to an ideal gas because all other info is quickly wiped out by chaos.

What’s the “mind-first prong”?

Here’s an entirely different way one might try to learn about natural abstraction.

Take a neural net, and go train it on some data from real-world physical systems (e.g. images or video, ideally). Then, do some interpretability to figure out how the net is representing those physical systems internally, what information is being passed around in what format, etc. Repeat for a few different net architectures and datasets, and look for convergence in what stuff the net represents and how.

(Is this just interpretability? Sort of. Interp is a broad label; most things called “interpretability” are not particularly relevant to the mind-first prong of natural abstraction, but progress on the mind-first prong would probably be considered interp research.)

In particular, what we’d really like here is to figure out something about how patterns in the data end up represented inside the net, and then go look in the net to learn about natural abstractions out in the territory. Ideally, we could somehow nail down the “how the natural abstractions get represented in the net” part without knowing everything about what natural abstractions even are (i.e. what even is the thing being represented in the net), so that we could learn about their type signature by looking at nets.

More generally: the mind-first prong is about looking for convergent laws governing how patterns get “burned in” to trained/​evolved systems like neural nets, and then using those laws to look inside nets trained on the real world, in order to back out facts about natural abstractions in the real world.

Note that anything one can figure out about real-world natural abstractions via looking inside nets (i.e. the mind-first prong) probably tells us a lot about the abstraction-relevant physical properties of physical systems (i.e. the territory-first prong), and vice versa.

So what has and hasn’t been figured out on the territory prong?

The territory prong has been our main focus for the past few years, and it was the main motivator for natural latents. Some key pieces which have already been nailed down to varying extents:

  • The Telephone Theorem: information which propagates over a nontrivial time/​distance (like e.g. energy in our ideal gas example) must be approximately conserved.

  • Natural Latents: in the language of natural latents, information which propagates over a nontrivial time/​distance (like e.g. energy in our ideal gas example) must be redundantly represented in many times/​places—e.g. we can back out the same energy by looking at many different time-slices, or roughly the same energy by looking at many different little chunks of the gas. If, in addition to that redundancy, that information also mediates between time/​space chunks, then we get some ontological guarantees: we’ve found all the information which propagates.

  • Some tricks which build on natural latents:

    • To some extent, natural latent conditions can nail down particular factorizations of high level summaries, like e.g. representing a physical electronic circuit as a few separate wires, transistors, etc. We do this by looking for components of a high-level summary latent which are natural over different physical chunks of the system.

    • We can also use natural latent conditions to nail down particular clusterings, like in A Solomonoff Inductor Walks Into A Bar.

… but that doesn’t, by itself, give us everything we want to know from the territory prong.

Here are some likely next bottlenecks:

  • String diagrams. Pretty much every technical diagram you’ve ever seen, from electronic circuits to dependency graphs to ???, is a string diagram. Why is this such a common format for high-level descriptions? If it’s fully general for high-level natural abstraction, why, and can we prove it? If not, what is?

  • The natural latents machinery says a lot about what information needs to be passed around, but says a lot less about how to represent it. What representations are natural?

  • High level dynamics or laws, like e.g. circuit laws or gas laws. The natural latents machinery might tell us e.g. which variables should appear in high level laws/​dynamics, but it doesn’t say much about the relationships between those variables, i.e. the laws/​dynamics themselves. What general rules exist for those laws/​dynamics? How can they be efficiently figured out from the low level? How can they be efficiently represented in full generality?

  • How can we efficiently sample the low-level given the high-level? Sure, natural latents summarize all the information relevant at long distances. But even with long-range signals controlled-for, we still don’t know how to sample a small low-level neighborhood. We would need to first sample a boundary which needs to be in-distribution, and getting an in-distribution boundary sample is itself not something we know how to do.

And what has and hasn’t been figured out on the mind prong?

The mind prong is much more wide open at this point; we understand it less than the territory prong.

What we’d ideally like is to figure out how environment structure gets represented in the net, without needing to know what environment structure gets represented in the net (or even what structure is in the environment in the first place). That way, we can look inside trained nets to figure out what structure is in the environment.

We have some foundational pieces:

  • Singular learning theory, or something like it, is probably a necessary foundational tool here. It doesn’t directly answer the core question about how environment structure gets represented in the net, but it does give us the right mental picture for thinking about things being “learned by the net” at all. (Though if you just want to understand the mental picture, this video is probably more helpful than reading a bunch of SLT.)

  • Natural latents and the Telephone Theorem might also be relevant insofar as we view the net itself as a low-level system which embeds some high-level logic. But that also doesn’t get at the core question about how environment structure gets represented in the net.

  • There’s a fair bit to be said about commutative diagrams. They, again, don’t directly address the core representation question. But they’re one of the most obvious foundational tools to try, and when applied to neural nets, they have some surprising approximate solutions—like e.g. sparse activations.

… but none of that directly hits the core of the problem.

If you want to get a rough sense of what a foothold on the core mind prong problem might look like, try Toward Statistical Mechanics of Interfaces Under Selection Pressure. That piece is not a solid, well-developed result; probably it’s not the right way to come at this. But it does touch on most of the relevant pieces; it gives a rough sense of the type of thing which we’re looking for.

Mostly, this is a wide open area which we’re working on pretty actively.