The Plan − 2025 Update

31 Dec 2025 20:10 UTC

96 points

What’s “The Plan”?

For several years now, around the end of the year, I (John) write a post on our plan for AI alignment. That plan hasn’t changed too much over the past few years, so both this year’s post and last year’s are written as updates to The Plan − 2023 Version.

I’ll give a very quick outline here of what’s in the 2023 Plan post. If you have questions or want to argue about points, you should probably go to that post to get the full version.

What is The Plan for AI alignment? Briefly: Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly. Then, look through our AI’s internal concepts for a good alignment target, and Retarget the Search.
That plan is not the One Unique Plan we’re targeting; it’s a median plan, among a whole space of possibilities. Generally, we aim to work on things which are robust bottlenecks to a broad space of plans. In particular, our research mostly focuses on natural abstraction, because that seems like the most robust bottleneck on which (not-otherwise-doomed) plans get stuck.
Most of the 2023 Plan post explains why natural abstraction seems like a robust bottleneck, with examples. Why is natural abstraction a bottleneck to interp? Why is natural abstraction a bottleneck to deconfusion around embedded agency? Why is natural abstraction a bottleneck to metaphilosophy? Why are half a dozen common Dumb Ideas (for which understanding abstraction does not seem like a bottleneck) all doomed?
What would “understanding abstraction” look like? We’re going to go into more depth on that topic in this post!
Why bother with theory in the first place? If you go look at engineering in practice, it typically works well in exactly those domains where we already have a basically-solid theoretical understanding of the foundations. Going full brute-force iteration typically does not actually work that well, unless the theory is already in place to dramatically narrow down the search space. And there are reasons for that.
How we get feedback along the way: insofar as abstraction is natural, we can learn about it by studying lots of ordinary physical systems, and checking how our math applies to lots of ordinary physical systems.
If timelines are short, we need to outsource some stuff to AI, but we probably end up mostly bottlenecked on humans’ understanding (e.g. to be able to distinguish slop from actual progress). So we mostly plan to remain focused on understanding this sort of foundational stuff until very late in the game.

So, how’s progress? What are you up to?

2023 and 2024 were mostly focused on Natural Latents—we’ll talk more shortly about that work and how it fits into the bigger picture. In 2025, we did continue to put out some work on natural latents, but our main focus has shifted.

Natural latents are a major foothold on understanding natural abstraction. One could reasonably argue that they’re the only rigorous foothold on the core problem to date, the first core mathematical piece of the future theory. We’ve used that foothold to pull ourselves up a bit, and can probably pull ourselves up a little further on it, but there’s more still to climb after that.

We need to figure out the next foothold.

That’s our main focus at this point. It’s wide open, very exploratory. We don’t know yet what that next foothold will look like. But we do have some sense of what problems remain, and what bottlenecks the next footholds need to address. That will be the focus of the rest of this post.

What are the next bottlenecks to understanding natural abstraction?

We see two main “prongs” to understanding natural abstraction: the territory-first prong, and the mind-first prong. These two have different bottlenecks, and would likely involve different next footholds. That said, progress on either prong makes the other much easier.

What’s the “territory-first prong”?

One canonical example of natural abstraction comes from the ideal gas (and gasses pretty generally, but ideal gas is the simplest).

We have a bunch of little molecules bouncing around in a box. The motion is chaotic: every time two molecules collide, any uncertainty in their velocity is amplified multiplicatively. So if an observer has any uncertainty in the initial conditions (which even a superintelligence would, for a real physical system), that uncertainty will grow exponentially over time, until all information is wiped out… except for conserved quantities, like e.g. the total energy of the molecules, the number of molecules, or the size of the box. So, after a short time, the best predictions our observer will be able to make about the gas will just be equivalent to using a Maxwell-Boltzmann distribution, conditioning on only the total energy (or equivalently temperature), number of particles, and volume. It doesn’t matter if the observer is a human or a superintelligence or an alien, it doesn’t matter if they have a radically different internal mind-architecture than we do; it is a property of the physical gas that those handful of parameters (energy, particle count, volume) summarize all the information which can actually be used to predict anything at all about the gas’ motion after a relatively-short time passes.

The key point about the gas example is that it doesn’t talk much about any particular mind. It’s a story about how a particular abstraction is natural (e.g. the energy of a gas), and that story mostly talks about properties of the physical system (e.g. chaotic dynamics wiping out all signal except the energy), and mostly does not talk about properties of a particular mind. Thus, “territory-first”.

More generally: the territory-first prong is about looking for properties of (broad classes of) physical systems, which make particular abstractions uniquely well-suited to those systems. Just like (energy, particle count, volume) is an abstraction well-suited to an ideal gas because all other info is quickly wiped out by chaos.

What’s the “mind-first prong”?

Here’s an entirely different way one might try to learn about natural abstraction.

Take a neural net, and go train it on some data from real-world physical systems (e.g. images or video, ideally). Then, do some interpretability to figure out how the net is representing those physical systems internally, what information is being passed around in what format, etc. Repeat for a few different net architectures and datasets, and look for convergence in what stuff the net represents and how.

(Is this just interpretability? Sort of. Interp is a broad label; most things called “interpretability” are not particularly relevant to the mind-first prong of natural abstraction, but progress on the mind-first prong would probably be considered interp research.)

In particular, what we’d really like here is to figure out something about how patterns in the data end up represented inside the net, and then go look in the net to learn about natural abstractions out in the territory. Ideally, we could somehow nail down the “how the natural abstractions get represented in the net” part without knowing everything about what natural abstractions even are (i.e. what even is the thing being represented in the net), so that we could learn about their type signature by looking at nets.

More generally: the mind-first prong is about looking for convergent laws governing how patterns get “burned in” to trained/evolved systems like neural nets, and then using those laws to look inside nets trained on the real world, in order to back out facts about natural abstractions in the real world.

Note that anything one can figure out about real-world natural abstractions via looking inside nets (i.e. the mind-first prong) probably tells us a lot about the abstraction-relevant physical properties of physical systems (i.e. the territory-first prong), and vice versa.

So what has and hasn’t been figured out on the territory prong?

The territory prong has been our main focus for the past few years, and it was the main motivator for natural latents. Some key pieces which have already been nailed down to varying extents:

The Telephone Theorem: information which propagates over a nontrivial time/distance (like e.g. energy in our ideal gas example) must be approximately conserved.
Natural Latents: in the language of natural latents, information which propagates over a nontrivial time/distance (like e.g. energy in our ideal gas example) must be redundantly represented in many times/places—e.g. we can back out the same energy by looking at many different time-slices, or roughly the same energy by looking at many different little chunks of the gas. If, in addition to that redundancy, that information also mediates between time/space chunks, then we get some ontological guarantees: we’ve found all the information which propagates.
Some tricks which build on natural latents:
- To some extent, natural latent conditions can nail down particular factorizations of high level summaries, like e.g. representing a physical electronic circuit as a few separate wires, transistors, etc. We do this by looking for components of a high-level summary latent which are natural over different physical chunks of the system.
- We can also use natural latent conditions to nail down particular clusterings, like in A Solomonoff Inductor Walks Into A Bar.

… but that doesn’t, by itself, give us everything we want to know from the territory prong.

Here are some likely next bottlenecks:

String diagrams. Pretty much every technical diagram you’ve ever seen, from electronic circuits to dependency graphs to ???, is a string diagram. Why is this such a common format for high-level descriptions? If it’s fully general for high-level natural abstraction, why, and can we prove it? If not, what is?
The natural latents machinery says a lot about what information needs to be passed around, but says a lot less about how to represent it. What representations are natural?
High level dynamics or laws, like e.g. circuit laws or gas laws. The natural latents machinery might tell us e.g. which variables should appear in high level laws/dynamics, but it doesn’t say much about the relationships between those variables, i.e. the laws/dynamics themselves. What general rules exist for those laws/dynamics? How can they be efficiently figured out from the low level? How can they be efficiently represented in full generality?
How can we efficiently sample the low-level given the high-level? Sure, natural latents summarize all the information relevant at long distances. But even with long-range signals controlled-for, we still don’t know how to sample a small low-level neighborhood. We would need to first sample a boundary which needs to be in-distribution, and getting an in-distribution boundary sample is itself not something we know how to do.

And what has and hasn’t been figured out on the mind prong?

The mind prong is much more wide open at this point; we understand it less than the territory prong.

What we’d ideally like is to figure out how environment structure gets represented in the net, without needing to know what environment structure gets represented in the net (or even what structure is in the environment in the first place). That way, we can look inside trained nets to figure out what structure is in the environment.

We have some foundational pieces:

Singular learning theory, or something like it, is probably a necessary foundational tool here. It doesn’t directly answer the core question about how environment structure gets represented in the net, but it does give us the right mental picture for thinking about things being “learned by the net” at all. (Though if you just want to understand the mental picture, this video is probably more helpful than reading a bunch of SLT.)
Natural latents and the Telephone Theorem might also be relevant insofar as we view the net itself as a low-level system which embeds some high-level logic. But that also doesn’t get at the core question about how environment structure gets represented in the net.
There’s a fair bit to be said about commutative diagrams. They, again, don’t directly address the core representation question. But they’re one of the most obvious foundational tools to try, and when applied to neural nets, they have some surprising approximate solutions—like e.g. sparse activations.

… but none of that directly hits the core of the problem.

If you want to get a rough sense of what a foothold on the core mind prong problem might look like, try Toward Statistical Mechanics of Interfaces Under Selection Pressure. That piece is not a solid, well-developed result; probably it’s not the right way to come at this. But it does touch on most of the relevant pieces; it gives a rough sense of the type of thing which we’re looking for.

Mostly, this is a wide open area which we’re working on pretty actively.

What links here?

johnswentworth and David Lorell

31 Dec 2025 20:10 UTC

96 points

21 comments7 min readLW link

Lucius Bushnaq 1 Jan 2026 9:18 UTC
22 points
8
Singular learning theory, or something like it, is probably a necessary foundational tool here. It doesn’t directly answer the core question about how environment structure gets represented in the net, but it does give us the right mental picture for thinking about things being “learned by the net” at all. (Though if you just want to understand the mental picture, this video is probably more helpful than reading a bunch of SLT.)
I think this is probably wrong. Vanilla SLT describes a toy case of how Bayesian learning on neural networks works. I think there is a big difference between Bayesian learning, which requires visiting every single point in the loss landscape and trying them all out on every data point, and local learning algorithms, such as evolution, stochastic gradient descent, AdamW, etc., which try to find a good solution using information from just a small number of local neighbourhoods in the loss landscape. Those local learning algorithms are the ones I’d expect to be used by real minds, because they’re much more compute efficient.
I think this locality property matters a lot. It introduces additional, important constraints on what nets can feasibly learn. It’s where path dependence in learning comes from. I think vanilla SLT was probably a good tutorial for us before delving into the more realistic and complicated local learning case, but there’s still work to do to get us to an actually roughly accurate model of how nets learn things.
If a solution consists of $1000$ internal pieces of machinery that need to be arranged exactly right to do anything useful at all, a local algorithm will need something like $O (e^{1000 c})$ update steps to learn it.^[1] In other words, it won’t do better than a random walk that aimlessly wanders around the loss landscape until it runs into a point with low loss by sheer chance. But if a solution with $1000$ internal pieces of machinery can instead be learned in small chunks that each individually decrease the loss a little bit, the leading term in the number of update steps required to find that solution scales exponentially with the size of the single biggest solution chunk, rather than with the size of the whole solution. So, if the biggest chunk had size $50$ , the total learning time will be around $O (e^{50 c})$ .^[2]
For an example where the solution cannot be learned in chunks like this, see the subset parity learning problem, where SGD really does need a number of update steps exponential in the effective parameter count of the whole solution to learn. Which for most practical purposes means it cannot learn the solution at all.
For a net to learn a big and complicated solution with high Local Learning Coefficient (LLC), it needs a learning story to find the solution’s basin in the loss landscape in a feasible timeframe. It can’t just rely on random walking, that takes too long. The expected total time it takes the net to get to a basin is, I think, determined mostly by the dimensionality of the mode connections from that basin to the rest of the landscape. Not just by the dimensionality of the basin itself, as would be the case for the sort of global, Bayesian learning modelled by vanilla SLT. The geometry of those connections is the core mathematical object that reflects the structure of the learning process and determines the learnability of a solution.^[3] Learning a big solution chunk that increases the total LLC by a lot in one go means needing to find a very low-dimensional mode connection to traverse. This takes a long time, because the connection interface is very small compared to the size of the search space. To learn a smaller chunk that increases the total LLC by less, the net only needs to reach a higher-dimensional mode connection, which will have an exponentially larger interface that is thus exponentially quicker to find.^[4]
I agree that vanilla SLT seems like a useful tool for developing the right mental picture of how nets learn things, but it is not itself that picture. The simplified Bayesian learning case is instructive for illuminating the connection between learning and loss landscape geometry in the most basic setting, but taken on its own it’s still failing to capture a lot of the structure of learning in real minds.
1. ^
  Where $c$ is some constant which probably depends on the details of the update algorithm.
2. ^
  I’m not going to add “I think” and “I suspect” to every sentence in this comment, but you should imagine them being there. I haven’t actually worked this out in math properly or tested it.
3. ^
  At least for a specific dataset and architecture. Modelling changes in the geometry of the loss landscape if we allow dataset and architecture to vary based on the mind’s own decisions as it learns might be yet another complication we’ll need to deal with in the future, once we start thinking about theories of learning for RL agents with enough freedom and intelligence to pick their learning curricula themselves.
4. ^
  To get the rough idea across I’m focusing here on the very basic case where the “chunks” are literal pieces of the final solution and each of them lowers the loss a little and increases the total LLC a little. In general, this doesn’t have to be true though. For example, a solution D with effective parameter count 120 might be learned by first learning independent chunks A and B, each with effective parameter count 50, then learning a chunk C with effective parameter count 30 which connects the formerly independent A and B together into a single mechanistic whole to form solution D. The expected number of update steps in this learning story would be $\approx e^{50 c} + e^{50 c} + e^{30 c} = O (e^{50 c})$ .
Steven Byrnes 31 Dec 2025 20:48 UTC
20 points
8
String diagrams. Pretty much every technical diagram you’ve ever seen, from electronic circuits to dependency graphs to ???, is a string diagram. Why is this such a common format for high-level descriptions? If it’s fully general for high-level natural abstraction, why, and can we prove it? If not, what is?
My explanation would be: our feeble human minds can’t track too many simultaneous interacting causal dependencies. So if we want to (e.g.) explain intuitively why the freezing point of methanol is −98°C as opposed to −96°C, we know we can’t, and we don’t even try, we just say “sorry, there isn’t any intuitive explanation of that, it’s just what you get experimentally, and oh it’s also what you get in this molecular (MD) dynamics simulation, here’s the code”. We don’t bother to make a technical diagram of why it’s −98 not −96 because it would be a zillion arrows going every which way and no one would understand it, so there’s no point in drawing it in the first place.
The MD code, incidentally, is a different structure with different interacting entities (variables, arrays, etc.), and is the kind of thing we humans can intuitively understand, and (relatedly) it can be represented pretty well as a flow diagram with boxes and arrows. So physical chemistry textbooks will talk about the MD code but NOT talk about the subtle detailed aspects of interacting methanol molecules that distinguish a −98°C freezing point from −96.
- Morpheus 1 Jan 2026 5:36 UTC
  5 points
  2
  Parent
  Molecular dynamics was also the first counterexample I was thinking of.
  
  So physical chemistry textbooks will talk about the MD code but NOT talk about the subtle detailed aspects of interacting methanol molecules that distinguish a −98°C freezing point from −96.
  
  Using heuristics here get’s easier though if you require less precision. I actually think that textbook could totally be written. Maybe not for why it is −98 rather than −96, but different heuristics and knowing the boiling points of other molecules should get you quite far (Maybe why it is −98 rather than −108). I would absolutely read that textbook.
Towards_Keeperhood 1 Jan 2026 23:29 UTC
15 points
4
Thanks for your yearly update!
On the plan:
- What is The Plan for AI alignment? Briefly: Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly. Then, look through our AI’s internal concepts for a good alignment target, and Retarget the Search.
I think this won’t work because many human-value-laden concepts aren’t very natural for an AI. More specifically, in the 2023 version of the plan you wrote:
The standard alignment by default story goes:
- Suppose the natural abstraction hypothesis^[2] is basically correct, i.e. a wide variety of minds trained/evolved in the same environment converge to use basically-the-same internal concepts.
- … Then it’s pretty likely that neural nets end up with basically-human-like internal concepts corresponding to whatever stuff humans want.
- … So in principle, it shouldn’t take that many bits-of-optimization to get nets to optimize for whatever stuff humans want.
- … Therefore if we just kinda throw reward at nets in the obvious ways (e.g. finetuning/RLHF), and iterate on problems for a while, maybe that just works?
In the linked post, I gave that roughly a 10% chance of working. I expect the natural abstraction part to basically work, the problem is [...]
I think the natural abstraction part here does not work—not because natural abstractions aren’t a thing—but because there’s an exception for abstractions that are dependent on the particular mind architecture an agent has.
Concepts like “love”, “humor”, and probably “consciousness” may be natural for humans but probably less natural for AIs.
But also we cannot just wire up those concepts into the values of an AI and expect the AI’s values to generalize correctly. The way our values generalize—how we will decide what to value as we grow smarter and do philosophical reflection—seems quite contingent on our mind architecture. Unless we have an AI that shares our mind architecture (like in Steven Byrnes’ agenda), we’d need to point the AI to an indirect specification of what we value, aka CEV. And CEV doesn’t seem like a simple natural abstraction that an AI would learn without us teaching it about CEV. And even if it knows CEV because we taught it, I find it hard to imagine how we would point the search process to it (even assuming we have a retargetable general purpose search).
Also see here and here. But mainly I think you need to think a lot more concretely about what goal we actually want to point the AI at.
Although I agree with this:
Generally, we aim to work on things which are robust bottlenecks to a broad space of plans. In particular, our research mostly focuses on natural abstraction, because that seems like the most robust bottleneck on which (not-otherwise-doomed) plans get stuck.
However, it does not look to me like you are making much progress relative to your stated beliefs of how close you are. Aka relative to (from your 2024 update where this statement sounded like it’s based on 10ish year timelines):
Earlier this year, David and I estimated that we’d need roughly a 3-4x productivity multiplier to feel like we were basically on track.
So here are some thoughts on how your progress looks to me, although I’ve not been following your research in detail anymore since summer 2024 (after your early natural latents posts):
Basically, it seems to me like you’re making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn’t be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality… Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...
Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once—lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like “tree” (as opposed to a particular tree)), although I couldn’t explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that’s still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment—I didn’t continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
ADDED: Basically, from having tried a little to understand natural/human ontologies myself it does not look to me like natural latents is much progress. But again I didn’t follow your work in detail and if you have concrete plans or evidence of how it’s going to be useful for pointing AIs then lmk.
- Kaarel 3 Jan 2026 22:24 UTC
  10 points
  0
  Parent
  Basically, it seems to me like you’re making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
  the intellect mustn’t be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality… Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...
  
  Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
  
  Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once—lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like “tree” (as opposed to a particular tree)), although I couldn’t explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
  
  I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that’s still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
  
  Not that this kind of study of language is necessarily the best way to proceed with alignment—I didn’t continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
  
  +1 to this. to me this looks like understanding some extremely toy cases a bit better and thinking you’re just about to find some sort of definitive theory of concepts. there’s just SO MUCH different stuff going on with concepts! wentworth+lorell’s work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell’s work (i’d probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining! there’s SO MANY questions! there’s a lot of different structure in eg a human mind that is important for our concepts working! minds are really big, and not just in content but also in structure (including the structure that makes concepting tick in humans)! and minds are growing/developing, and not just in content but also in structure (including the structure that makes concepting tick in humans)! “what’s the formula for good concepts?” should sound to us like “what’s the formula for useful technologies?” or “what’s the formula for a strong economy?”. there are very many ideas that go into having a strong economy, and there are probably very many ideas that go into having a powerful conceptive system. this has mostly just been a statement of my vibe/position on this matter, with few arguments, but i discuss this more here.
  
  on another note: “retarget the search to human values” sounds nonsensical to me. by default (at least without fundamental philosophical progress on the nature of valuing, but imo probably even given this, at least before serious self-re-programming), values are implemented in a messy(-looking) way across a mind, and changing a mind’s values to some precise new thing is probably in the same difficulty tier as re-writing a new mind with the right values from scratch, and not doable with any small edit
  
  concretely, what would it look like to retarget the search in a human so that (if you give them tools to become more capable and reasonable advice on how to become more capable “safely”/”value-preservingly”) they end up proving the riemann hypothesis, then printing their proof on all the planets in this galaxy, and then destroying all intelligent life in the galaxy (and committing suicide)? this is definitely a simpler thing than object-level human values, and it’s plausibly more natural than human values even in a world in which there is already humanity that you can try to use as a pointer to human values. it seems extremely cursed to make this edit in a human. some thoughts on a few approaches that come to mind:
  - you could try to make the human feel good about plans for futures that involve learning a bunch of analysis and number theory, and about plan-relevant visions of the future that involve having a proof of the riemann hypothesis in hand in particular, and so on. it seems pretty clear that this doesn’t generalize correctly, and in particular that the human isn’t actually going to do the deeply unnatural thing of committing suicide after finishing the rest. ^[1] i think it’s very unlikely that they’ll even focus much on proving the riemann hypothesis in particular. if you’re really good at this sort of editing, maybe they will get really into analysis and number theory for a while, i guess, and it might even affect what happens in the very far future. ^[2] but the far future isn’t going to look like what you wanted.
  - with like 100 years of philosophy and neuroscience research, i think one might get into a position where one could edit a human to be locally trying to solve some math problem for like 10 minutes, with the edit doing sth like what happens when one naturally just decides to work on a math problem without it fitting into one’s life/plans in any very deep way, eg just to learn. there is retargetable search in humans in that sense, and i think it’s probable sth like this will be present in the first AGI as well. but this is different than editing the human to have some specific different long-term values. on longer timescales than 10 minutes, the human will have their mess-values kick in again, implemented in/as eg many context-specific drives, habits, explicit and implicit principles, explicit and implicit meta-principles, understanding of the purposes of various things, ways of harmonizing attitudes, various processes running on various human institutions and other humans, etc. ^[3] it would be a motte and bailey to argue “it is generic for a mind to have at least some sort of targetable search ability” (in a way that considers the 10 min thing one could in principle do to a human an example), and then to go from this to “it is generic for a mind as a whole to have some sort of retargetable search structure, like with an ultimate goal slot in which something can be written”.
  - you could try to edit the human’s memories in a really careful way to make them think that they have made a very serious promise to do this riemann hypothesis thing. doing this is probably not possible in all but at most a very small fraction of humans because humans almost universally don’t have a strong enough promise capability to actually stick to this over the very long term. (actually, i’d mostly guess it’s not possible in any humans, because it’s such a fucked thing to promise. what would the story be of how you made this promise, of which you now have fake memories? maybe there’s some construction… but the promise-keeping part will have to fight a huge long-term war against all the many other value-bearing components of the human, that are all extremely unhappy about this life path.) also, if it were possible to plant plausible memories of making the promise (maybe with different choices for the details of the promise), you could probably just have the human make the promise the good old-fashioned way. anyway, default AGIs won’t be deeply social beings like humans, so it would be extremely weird for an AGI to already have machinery for making promises installed. it’s also extremely difficult to do this in a way that the guy never realizes they were just tampered with and so isn’t actually bound by the promise (after realizing which they will probably ignore it).
  - but maybe there’s a better sort of thing you could try on a human, that i’m not quickly thinking of?
  maybe the position is “humans aren’t retargetable searchers (in their total structure, in the way needed for this plan), but the first AGI will probably be one”. it seems very likely to me that values will in fact be diffusely and messily implemented in that AGI as well. for example, there won’t even remotely be a nice cleavage between values and understanding
  1. ↩︎
    a response: the issue is that i’ve chosen an extremely unnatural task. a counterresponse: it’s also extremely unnatural to have one’s valuing route through an alien species, which is what the proposal wants to do to the AI
  2. ↩︎
    that said, i think it’s also reasonably natural to be the sort of guy who would actively try to undo any supposed value changes after the fact, and it’s reasonably natural to be the sort of guy whose long-term future is more governed by stuff these edits don’t touch. in these cases, these edits would not affect the far future, at least not in the straightforward way
  3. ↩︎
    these are all given their correct meaning/function only in the context of their very particular mind, in basically all its parts. so i could also say: their mind just kicks in again in general.
  - Lucius Bushnaq 18 Mar 2026 22:16 UTC
    14 points
    2
    Parent
    I think the core intuition that makes me believe some sort of relatively simple edit might possibly achieve this comes from the observation that I can ask myself what plans I would make if I had some arbitrary different set of goals, and the plans my brain supplies in answer aren’t much worse than those I make for the goals I actually have. This indicates that my plan-making capacity is, at least on short time scales, essentially orthogonal to my goals and can be re-pointed in arbitrary directions very readily. If an edit can trigger that same process, but stop my brain from ever ceasing the mental motion of reasoning through the hypothetical, that would already be an impressive amount of targetable general optimisation power.
    
    To be clear, I am not suggesting that the actual edit one would actually make to an ASI in real life looks much like making the ASI start a thought experiment or roleplay that never stops. (Though current “alignment” techniques for current AIs do seem to work sort of like that, and I think that actually isn’t entirely a coincidence.) I am just trying to gesture at an intuition pump for why one might think that the optimisation power of some general minds that occur in real life could be quite readily and precisely re-targetable if you can manipulate their internals.
    
    A related intuition: Many general agents solve problems by, for example, recursively hacking them up into subproblems, or recursively relating them to easier problems, and then solving these other problems instead. To the extent the agents solve the many different problems using one general set optimisation machinery, that general optimisation machinery needs to be very readily and precisely retargetable at arbitrary problems. If you could get inside these retargeting loop(s), you could perhaps exploit them to point the agent along a very different optimisation trajectory, or make a new agent out of the existing agent relatively cheaply (there isn’t actually a hard distinction between these two options of course).
    - Kaarel 20 Mar 2026 2:52 UTC
      4 points
      0
      Parent
      
      If an edit can trigger that same process, but stop my brain from ever ceasing the mental motion of reasoning through the hypothetical, that would already be an impressive amount of targetable general optimisation power.
      
      I agree with this , but tentatively disagree with the . I’m plausibly/probably on board with editing for short time scales making sense ^[1] , but I think it’s cursed to make an edit that makes it so you don’t cease to work on the problem. For a concrete example, let’s consider a smart human on a deserted island for 50 years, with lots of resources so staying alive is easy and by default the human could do whatever they want. ^[2] Do you think that there is a fairly “small”/”simple” edit that could be made to this human at the beginning so that for their 50 years on the island, they will be working on some particular hard open problem in algebraic number theory a significant fraction of the time? ^[3] This seems really cursed to me. What would it look like to be this human after the edit? What happens when the human thinks “wait why am I working on this problem again?” or “what should I be doing?”. What happens when the human gets drawn toward other questions, as they are by default? One could try to edit away machinery that makes the human ask such questions and machinery that makes the human get interested in various things, but I think that asking these questions is caused/constituted by structure/processes such that removing that/those in any simple way breaks the human’s thinking, as is getting interested in various things. In particular, to have a chance of solving the hard math problem, it seems like the human needs to be able to ask questions in a basically open-ended way, and needs to be able to really think about what questions should be asked, but this is in a great deal of tension with not asking “wait why am I working on this problem again?” or “what should I be doing?”. There are some thinking-structures/processes determining what questions the human is interested in, and these are crucial for selecting other questions to study which help the human understand stuff important for solving the math problem (e.g. coming up with toy special cases of the problem, e.g. studying other related problems, e.g. trying to solve subproblems after proposing some decomposition), and it’s really hard to keep this stuff functional while making the human not ever think “wait why am I working on this problem again?”. It seems cursed to not have the human ask this question implicitly, and also cursed to not have the human ask this question explicitly. One possible way out is to say: ok we just let the human ask “wait why am I doing this?”, and make it so some answer is consistently provided which makes it seem reasonable to the human to continue working on the problem. I have a hard time coming up with a way this could look like. Here are some options I’ve considered:
      
      We could try to make the human think that they will be rewarded a lot for solving this math problem by implanting false memories of some past events (like that there are hidden overseers who will take the human back to civilization once the human solves the math problem). I think this could maybe be done, but it has the major issue that this makes sense roughly if and only if you could just actually have a robust verification+reward setup that would make the human instrumentally want to solve the math problem. But then you could just actually set up that mechanism and not have to do this memory editing at all, so your clever search retargeting is redundant. Like, also in the AI case, just set up the reward/verification mechanism instead of implanting these false memories of it being present. Of course, setting up such a mechanism can be hard, but then also it’s hard to implant false memories of a compelling such mechanism being present? There are ways this isn’t a completely perfect implication though ^[4] and then maybe it would matter to have this search retargeting ability? I could be convinced retargeting is maybe useful if one could provide a reasonable construction of this kind. I’m flagging this for myself as something to think more about later.
      We could try to make edits that turn the human into someone who is just deeply interested in this particular problem, like maybe Andrew Wiles was wrt Fermat’s last theorem ^[5] . I think there could be a way to do this for problems you are already close to being really interested in — e.g. one might be able to subtly nudge how Terence Tao assigns intuitive interestingness to many questions/[open problems]/topics and get him to spend much more time on the Collatz conjecture or the Riemann hypothesis or whatever. It seems really hard to figure out a subtle nudge that gets someone to be really interested in a problem which isn’t already something they are very familiar with (like, imagine the problem looking like just another statement with 5 quantifiers involving some unfamiliar objects) — like, the nudging method relies on increasing the valences of various other intellectual things which solving the problem would contribute to, but this isn’t possible if you don’t already have mental structure/content around those questions set up, and it seems hard to set up that structure/content except by having the human study those other things, but that’s sort of a chicken and egg situation. ^[6] For a problem that is really novel, it’s also hard to know ahead of time what other question-representations need to be created and upvalenced — it takes work to figure out how the question relates to other questions — but getting the human to do this work is a chicken and egg situation, and we don’t want to do this work ourselves because doing it decently would plausibly constitute a significant fraction of the work required to solve the problem. Generally my take is that after you make the edit, the guy just pursuing what’s interesting to them will be developing pretty chaotically, and it seems cursed to try to get this chaotic system to hit something which has a low “interestingness/familiarity prior” (like, what would one upvalence when trying to get a very talented 5 year old or guy with no math-related degree to eventually make progress on the Riemann hypothesis). I guess I’ve mostly been imagining that the math problem is “universally interesting” (like any famous conjecture would be), even if initially unfamiliar, but really for the case of practical interest we should probably imagine a problem that isn’t so universally interesting. In that case, your targeting of these interestingness-dynamics needs to be even more precise, because the kinda-crap-sentence-number- attractor is smaller/weaker than the attractor for the Riemann hypothesis. Also, it seems plausible that if you could do a thing this precise, you could also just retarget the guy to be very interested in this problem in a black-box way, but not sure + haven’t thought this through carefully.
      You could try to invent a religion in which solving the math problem is extremely important, and try to edit the human into an adherent. I think this has a combination of the issues of the previous two approaches.
      
      Despite the above approaches not getting there so far, I’m open to there being some clever way to pull this off. Even though I haven’t come up with anything decent yet, thinking about this for a few hours still made me somewhat more optimistic about long-term retargeting making sense, at least toward fairly well-defined research problems and if we’re fine with losing like an order of magnitude from full effort. Overall it still looks cursed I guess, but somewhat less than I expected earlier.
      
      Instead of doing a single edit at the beginning, we can also consider the option of continuing to do various edits across the 50 year period, as I think you mention in your comment. I think this is somewhat more promising relative to the single edit case. I spent a few hours thinking about this at some point a while ago, saw various obstacles, and didn’t come up with anything that seemed promising then, but there could be something. I’m flagging this for myself as another thing to think more about later.
      
      (There are also additional major problems with the case of editing an AI to have human values (or maybe specifically to serve a particular human) compared to this 50 year human example. It’s a weirder target, and the edit now needs to continue to apply over arbitrary future development, which is cursed. Instead of retargeting to human values, one could retarget to e.g. making mind uploads, with a plan to use these mind uploads to ban AGI later. Unfortunately this doesn’t make sense once one considers that other AI-makers are ending the world before you get anywhere with this (and that the government is probably taking over your lab I guess), even if you have retargeting methods ready to go. In either case, you have to pause the foom in some specific somewhat super-human capability window, which is problematic because that might well be passed in only a few days of fooming (one reason: if the AI can solve a >100 year problem such as making mind uploads in a month, that suggests it could do years of human algo progress conceptual work in a day) or be passed inside your deployment run.)
      
      ↩︎
      like, it seems plausible/likely that with at most a few centuries of philosophy+neuroscience research, we could edit a human to be interested in a math problem for 10 minutes once, at least assuming we are granted an ability to do low-level edits for free. i have some caveat around an alien’s values being a more cursed thing to retarget to even for 10 minutes than solving a nice math problem, but i’ll ignore this because other issues seem more important/interesting for now
      
      ↩︎
      I’m saying the human is on an otherwise uninhabited island so we don’t have to think about the effects of interacting with other unedited people / editing many people at once.
      
      ↩︎
      let’s say that averaging at least idk hours a week of work [which the person reasonably takes to be directed toward the problem] would be considered a win
      
      ↩︎
      like, maybe you can create vague memories of some extremely good verifiers being around, initialized with little doubt registered about this, such that you can’t actually create verifiers that are this good, with this story ultimately not hanging together but hanging together well enough so as to not be taken apart in 50 years
      
      ↩︎
      or the Taniyama–Shimura conjecture or understanding some stuff about elliptic curves or whatever one would say if one knew anything about this case
      
      ↩︎
      There might be a way to bootstrap but I’ll file this under “schemes involving many edits over time”, which I won’t discuss in the present comment.
  - Towards_Keeperhood 12 Jan 2026 22:38 UTC
    3 points
    −2
    Parent
    wentworth+lorell’s work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell’s work (i’d probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining!
    Btw if you mean there are 10k contributions already that are on the level of John’s contributions, I strongly disagree with this. I’m not sure whether John’s math is significantly useful, and I don’t think it’s been that much progress relative to “almost on track to maybe solve alignment”, but in terms of (alignment) philosophy John’s work is pretty great compared to academic philosophy.
    In terms of general alignment philosophy (not just work on concepts but also other insights), I’d probably put John’s collective works in 3rd place after Eliezer Yudkowsky and Douglas Hofstadter. The latter is on the list mainly because of Surfaces and Essences, which I can recommend (@johnswentworth).
    Aka I’d probably put John above people like Wittgenstein, although I admit I don’t know that much about the works of philosophers like Wittgenstein. Could be that there are more insights in the collective works of Wittgenstein, but if I’d need to read through 20x the volume because he doesn’t write clearly enough that’s still a point against him. Even if a lot of John’s insights have been said before somewhere, writing insights clearly provides a lot of value.
    Although John’s work on concepts play a smaller role in what I think makes John a good alignment philosopher than it does in his alignment research. Partially I think John just has some great individual posts like science in a high dimensional world, you’re not measuring what you think you’re measuring, why agent foundations (coining the word true names), and probably a couple more less known older ones that I haven’t processed fully yet. And generally his philosophy that you need to figure out the right ontology is good.
    But also tbc, this is just alignment philosophy. In terms of alignment research, he’s a bit further down my list, e.g. also behind Paul Christiano and Steven Byrnes.
    - Kaarel 13 Jan 2026 3:56 UTC
      4 points
      0
      Parent
      to clarify a bit: my claim was that there are 10k individuals in history who have contributed at least at the same order of magnitude to our understanding of concepts — like, in terms of pushing human understanding further compared to the state of understanding before their work. we can be interested in understanding what this number is for this reason: it can help us understand whether it’s plausible that this line of inquiry is just about to find some sort of definitive theory of concepts. (i expect you will still have a meaningfully lower number. i could be convinced it’s more like 1000 but i think it’s very unlikely to be like 100.) i think wentworth is obviously much higher eg if you rank people on publicly displayed alignment understanding, very likely in the top 10
- TristanTrim 3 Jan 2026 19:56 UTC
  3 points
  0
  Parent
  
  Unless we have an AI that shares our mind architecture (like in Steven Byrnes’ agenda)
  
  I think there’s an important distinction here between (a) “including human value concepts” and (b) “being able to point at human value concepts”. Systems sharing our mind architecture make (a) more likely but do not make (b) more likely, and I think (b) is required for good outcomes.
Towards_Keeperhood 1 Jan 2026 23:32 UTC
13 points
2
I’d be curious about how your timelines updated. Last year you wrote:
Over the past year, my timelines have become even more bimodal than they already were. The key question is whether o1/o3-style models achieve criticality (i.e. are able to autonomously self-improve in non-narrow ways), including possibly under the next generation of base model. My median guess is that they won’t and that the excitement about them is very overblown. But I’m not very confident in that guess.
If the excitement is overblown, then we’re most likely still about 1 transformers-level paradigm shift away from AGI capable of criticality, and timelines of ~10 years seem reasonable. Conditional on that world, I also think we’re likely to see another AI winter in the next year or so.
If the excitement is not overblown, then we’re probably looking at more like 2-3 years to criticality. In that case, any happy path probably requires outsourcing a lot of alignment research to AI, and then the main bottleneck is probably our own understanding of how to align much-smarter-than-human AGI.
To me it seems plausible that we’re in some intermediate world where progress continues but we still have like 5 years to criticality.
Lorxus 5 Jan 2026 19:26 UTC
6 points
2
Thanks for the yearly update! I have some thoughts on why we care about string diagrams and commutative diagrams so much. (It’s not even just “category theory”.) I’ll poke you later to talk about them in greater depth but for quick commentary:

For string diagrams it’s something like “string diagrams are a minimal way to represent both timelike propagation of information and spacelike separation of causal influence”. If you want to sketch out some causal graph, string diagrams are the natural best way to do that. From there you start caring about monoidal structure and you’re off to the races.

For commutative diagrams the story is different but related, though admittedly I understand what’s going on with [commutative diagrams]+[sparse activations] way way less. I’d say it’s something like “the existence of a satisfied commutative diagram puts strong constraints on other aspects of the neural net, like what form the latent space(s?) and maps to and from them have to look like and what they have to do and what information has to get preserved or discarded”.

For one last observation, a friend’s been poking me about the sense that constraints and equipartition/environment are dual to each other, and that there’s a correspondence (for bounded systems at least) between phase volume size-change and the sign of something like an informational analogue to thermodynamic temperature. (And also that your approach is importantly incomplete in currently only dealing in theory and not engineering, but for my part I think that that’s priced in to how you talk about your plan.)

Bother me on Discord?
TristanTrim 3 Jan 2026 19:35 UTC
3 points
0
The natural latents machinery says a lot about what information needs to be passed around, but says a lot less about how to represent it. What representations are natural?
I like this question. The direction I’m currently thinking is spaces and distributions within them.
What we’d ideally like is to figure out how environment structure gets represented in the net, without needing to know what environment structure gets represented in the net (or even what structure is in the environment in the first place). That way, we can look inside trained nets to figure out what structure is in the environment.
If we consider that inputs and outputs of nets contain distributions which are implied at training time, the net may be storing transformations that do not capture or represent any given aspect of the distribution it operates on, specifically in cases where details of the distribution are irrelevant to the operation it performs. However, this means I am optimistic about unsupervised methods, eg, autoencoders and sequence predictors.
Konjkov Vladimir 1 Jan 2026 15:50 UTC
0 points
0
Natural Latents: Latent Variables Stable Across Ontologies
I don’t quite understand this multi-layered construction — at the foundation of everything lies quantum physics with unitary evolution, in which, due to quantum Darwinism, only pointer states are preserved.
https://arxiv.org/html/2510.06867v1
Quantum systems achieve objectivity by redundantly encoding information about themselves into the surrounding environment, through a mechanism known as quantum Darwinism. When this happens, observes measure the environment and infer the system to be in one of its pointer states.
Examples of pointer states are the “alive” and “dead” cat states in Schrödinger’s experiment, since they are encoded into the surrounding environment, even before we open the box.
Now we introduce Natural Latents, which surround surrounding environment? Observing the observer who observes the cat?
Due to quantum Darwinism information is already redundantly stored in the surrounding environment, including human brains and AI databases, and therefore becomes objective. Why is the next layer needed?
- TristanTrim 3 Jan 2026 20:03 UTC
  4 points
  0
  Parent
  In order to understand understanding we need to study things that understand, so observing observers is exactly the thing to do.
- Mitchell_Porter 3 Jan 2026 13:13 UTC
  3 points
  1
  Parent
  Quantum Darwinism reminds me of one part of the Copenhagen catechism, the idea that the quantum-to-classical transition (as we now call it) somehow revolves around “irreversible amplification” of microscopic to macroscopic properties. In quantum Darwinism, the claim instead is that properties become objectively present when multiple observers could agree on them. As https://arxiv.org/abs/1803.08936 points out on its first page, this is more like “inter-subjectivity” than objectivity, and there are also edge cases where the technical criterion simply fails. Like every other interpretation, quantum Darwinism has not resolved the ontological mysteries of quantum theory.
  As for this Natural Latents research program, it seems to be studying the compressed representations of the world that brains and AIs form, and looking for what philosophers call “natural kinds”, in the form of compressions and categorizations that a diverse variety of learning systems would naturally make.
  - Konjkov Vladimir 3 Jan 2026 16:53 UTC
    3 points
    0
    Parent
    The authors of the article express their personal viewpoint on the definition of subjectivity.
    
    The definition of what it means to be objective in-and-of-itself is up for debate (this definition can be thought of as inter-subjectivity rather than objectivity per se), but that debate is not purpose of this Letter.
    I can also agree that a specially prepared environment, for example one consisting of a wall of entangled qubits, does not ensure objectivity, since it simply continues the chain of superpositions: atom, Geiger counter, vial, cat, wall in the thought experiment. But our world is arranged such that this situation does not occur, at least without deliberate intervention by an experimenter.
    I tried to imagine such a thought experiment — it is possible with a qubit, but not with a cat. In fact, this would mean creating a long-lived quantum memory, which I do not rule out. Does this negate objectivity?
  - Konjkov Vladimir 4 Jan 2026 0:08 UTC
    1 point
    0
    Parent
    I would like to note that a pointer state is the state of a pointer of a measuring device—this is where the name comes from. For example, in the case of Schrödinger’s cat, one can construct a device that indicates whether the cat is alive or dead, thereby ensuring objectivity even in the absence of a human observer.
    Moreover, such devices can rely on different measurable signals: an electroencephalogram, a cardiogram, the cat’s heat production, the amount of CO₂ it exhales, and so on. A classical device that would display a superposition of the states ⟨alive⟩ + ⟨dead⟩ cannot be constructed; therefore, such a superposition is not a pointer state. Human sensory organs are themselves such devices, as is the environment surrounding the cat: EEG and ECG signals generate electromagnetic radiation in the environment, heat production raises its temperature, and CO₂ emission increases the ambient CO₂ concentration.
    The mere existence of such “devices” already makes pointer states objective, because any number of observers can look at the pointers!
    Can good and evil be pointer states? And if they can, then this would be an objective characteristic, understood in the same way by both humans and AI and the alignment problem is already solved!
    - Mitchell_Porter 4 Jan 2026 3:32 UTC
      3 points
      0
      Parent
      If you only have unitary evolution, you end up with superpositions of the form
      |system state 1> |pointer state 1> + |systems state 2> |pointer state 2> + … + small cross-terms
      Are you proposing that we ignore all but one branch of this superposition?
      - Konjkov Vladimir 4 Jan 2026 6:58 UTC
        3 points
        0
        Parent
        My favorite point origins of Born’s rule of view is the following. The final state is a superposition, but we are all inside it.
        And since these two states are orthogonal, state 1⟩ does not see 2⟩, and vice versa; God only knows.
        
        The works by Zurek (https://arxiv.org/pdf/1807.02092) and the more recent one (https://arxiv.org/html/2209.08621v6) shed more light on this.
        
        Here one has to be very careful with the proof of such a multiverse picture, because, as usual, we replace the observed averaging of outcomes of experiments repeated in time in our world by the squared modulus of the (normalized) amplitude interpreted as the probability of our world which effectively means averaging over an ensemble of parallel worlds, whose number since the birth of the universe may be infinite.
        The explanatory idea is there, but even in the 2025 paper it still looks underdeveloped. I don’t understand this very well, so I can’t give more details.
    - green_leaf 4 Jan 2026 8:50 UTC
      1 point
      0
      Parent
      Can good and evil be pointer states? And if they can, then this would be an objective characteristic
      This would appear to be just saying that if we can build a classical detector of good and evil, good and evil are objective in the classical sense.