Lucius Bushnaq

Karma: 4,737

AI notkilleveryoneism researcher, focused on interpretability.

Personal account, opinions are my own.

I have signed no contracts or agreements whose existence I cannot mention.

Lucius Bushnaq 23 Jan 2026 8:21 UTC
3 points
0
on: Value Learning Needs a Low-Dimensional Bottleneck
‘Internally coherent’, ‘explicit’, and ‘stable under reflection’ do not seem to me to be opposed to ‘simple’.
I also don’t think you’d necessarily need some sort of bias toward simplicity introduced by a genetic bottleneck to make human values tend (somewhat) toward simplicity.^[1] Effective learning algorithms, like those in the human brain, always need a strong simplicity bias anyway to navigate their loss landscape and find good solutions without getting stuck. It’s not clear to me that the genetic bottleneck is actually doing any of the work here. Just like an AI can potentially learn complicated things and complicated values from its complicated and particular training data even if its loss function is simple, the human brain can learn complicated things and complicated values from its complicated and particular training data even if the reward functions in the brain stem are (somewhat) simple. The description length of the reward function doesn’t seem to make for a good bound on the description length of the values learned by the mind the reward function is training, because what the mind learns is also determined by the very high description length training data.^[2]
1. ^
  I don’t think human values are particularly simple at all, but they’re just not so big they eat up all spare capacity in the human brain.
2. ^
  At least so long as we consider description length under realistic computational bounds. If you have infinite compute for decompression or inference, you can indeed figure out the values with just a few bits, because the training data is ultimately generated by very simple physical laws, and so is the reward function.

Lucius Bushnaq 23 Jan 2026 8:04 UTC
12 points
3
in reply to: beren’s comment on: Value Learning Needs a Low-Dimensional Bottleneck
I don’t think this is evidence that values are low-dimensional in the sense of having low description length. It shows that the models in question contains a one-dimensional subspace that indicates how things in the model’s current thoughts are judged along some sort of already known goodness axes, not that the goodness axis itself is an algorithmically simple object. The floats that make up that subspace don’t describe goodness, they rely on the models’ pre-existing understanding of goodness to work. I’d guess the models also have only one or a very small number of directions for ‘elephant’, that doesn’t mean ‘elephant’ is a concept you could communicate with a single 16-bit float to an alien who’s never heard of elephants. The ‘feature dimension’ here is not the feature dimension relevant for predicting how many data samples it takes a mind to learn about goodness, or learn about elephants.

Lucius Bushnaq 15 Jan 2026 8:31 UTC
7 points
−2
in reply to: johnswentworth’s comment on: On green
Fwiw I think I feel companionate love, to the point of sometimes experiencing a sort of regret for not being able to hug everyone in the universe, and getting emotionally attached to random trees, rocks, frozen peas^[1], and old pairs of shoes^[2] when I was a kid. And I also recall reading this and thinking: “Screw Green.”
1. ^
  After my mother explained to me that the pea seeds were intended to make new pea plants, I felt guilty for us eating them. For a while I insisted my mother throw a few frozen peas out the window into the tree line every time we cooked with them, because my ca. four year old brain figured that way at least a few of them might have some chance to become new pea plants.
2. ^
  Being ca. four years old, I was growing pretty quickly and got to big for my previous pair of shoes. So my parents wanted to throw them away. I felt horrible for betraying the poor friendly shoes like that, so my parents allowed me to keep them on my shelf for a few years until I got old enough to internalise that shoes aren’t people and don’t have qualia.

Lucius Bushnaq 15 Jan 2026 8:05 UTC
4 points
0
on: Simple summary of AI Safety laws
Sorry, I’m a law dummy: Are these maximum penalties one-offs? As in, can a company just pay $1M in fines whenever they’re caught ignoring CA SB 53 and then go right on ignoring it with no escalating consequences?

Lucius Bushnaq 1 Jan 2026 9:18 UTC
20 points
7
on: The Plan − 2025 Update
Singular learning theory, or something like it, is probably a necessary foundational tool here. It doesn’t directly answer the core question about how environment structure gets represented in the net, but it does give us the right mental picture for thinking about things being “learned by the net” at all. (Though if you just want to understand the mental picture, this video is probably more helpful than reading a bunch of SLT.)
I think this is probably wrong. Vanilla SLT describes a toy case of how Bayesian learning on neural networks works. I think there is a big difference between Bayesian learning, which requires visiting every single point in the loss landscape and trying them all out on every data point, and local learning algorithms, such as evolution, stochastic gradient descent, AdamW, etc., which try to find a good solution using information from just a small number of local neighbourhoods in the loss landscape. Those local learning algorithms are the ones I’d expect to be used by real minds, because they’re much more compute efficient.
I think this locality property matters a lot. It introduces additional, important constraints on what nets can feasibly learn. It’s where path dependence in learning comes from. I think vanilla SLT was probably a good tutorial for us before delving into the more realistic and complicated local learning case, but there’s still work to do to get us to an actually roughly accurate model of how nets learn things.
If a solution consists of $1000$ internal pieces of machinery that need to be arranged exactly right to do anything useful at all, a local algorithm will need something like $O (e^{1000 c})$ update steps to learn it.^[1] In other words, it won’t do better than a random walk that aimlessly wanders around the loss landscape until it runs into a point with low loss by sheer chance. But if a solution with $1000$ internal pieces of machinery can instead be learned in small chunks that each individually decrease the loss a little bit, the leading term in the number of update steps required to find that solution scales exponentially with the size of the single biggest solution chunk, rather than with the size of the whole solution. So, if the biggest chunk had size $50$ , the total learning time will be around $O (e^{50 c})$ .^[2]
For an example where the solution cannot be learned in chunks like this, see the subset parity learning problem, where SGD really does need a number of update steps exponential in the effective parameter count of the whole solution to learn. Which for most practical purposes means it cannot learn the solution at all.
For a net to learn a big and complicated solution with high Local Learning Coefficient (LLC), it needs a learning story to find the solution’s basin in the loss landscape in a feasible timeframe. It can’t just rely on random walking, that takes too long. The expected total time it takes the net to get to a basin is, I think, determined mostly by the dimensionality of the mode connections from that basin to the rest of the landscape. Not just by the dimensionality of the basin itself, as would be the case for the sort of global, Bayesian learning modelled by vanilla SLT. The geometry of those connections is the core mathematical object that reflects the structure of the learning process and determines the learnability of a solution.^[3] Learning a big solution chunk that increases the total LLC by a lot in one go means needing to find a very low-dimensional mode connection to traverse. This takes a long time, because the connection interface is very small compared to the size of the search space. To learn a smaller chunk that increases the total LLC by less, the net only needs to reach a higher-dimensional mode connection, which will have an exponentially larger interface that is thus exponentially quicker to find.^[4]
I agree that vanilla SLT seems like a useful tool for developing the right mental picture of how nets learn things, but it is not itself that picture. The simplified Bayesian learning case is instructive for illuminating the connection between learning and loss landscape geometry in the most basic setting, but taken on its own it’s still failing to capture a lot of the structure of learning in real minds.
1. ^
  Where $c$ is some constant which probably depends on the details of the update algorithm.
2. ^
  I’m not going to add “I think” and “I suspect” to every sentence in this comment, but you should imagine them being there. I haven’t actually worked this out in math properly or tested it.
3. ^
  At least for a specific dataset and architecture. Modelling changes in the geometry of the loss landscape if we allow dataset and architecture to vary based on the mind’s own decisions as it learns might be yet another complication we’ll need to deal with in the future, once we start thinking about theories of learning for RL agents with enough freedom and intelligence to pick their learning curricula themselves.
4. ^
  To get the rough idea across I’m focusing here on the very basic case where the “chunks” are literal pieces of the final solution and each of them lowers the loss a little and increases the total LLC a little. In general, this doesn’t have to be true though. For example, a solution D with effective parameter count 120 might be learned by first learning independent chunks A and B, each with effective parameter count 50, then learning a chunk C with effective parameter count 30 which connects the formerly independent A and B together into a single mechanistic whole to form solution D. The expected number of update steps in this learning story would be $\approx e^{50 c} + e^{50 c} + e^{30 c} = O (e^{50 c})$ .

Rotations in Superposition

Linda Linsefors and Lucius Bushnaq

15 Dec 2025 14:58 UTC

54 points

6 comments11 min readLW link

Lucius Bushnaq 15 Dec 2025 0:12 UTC
7 points
7
on: Follow-through on Bay Solstice
This was my favourite solstice to date. Thank you.

Lucius Bushnaq 11 Dec 2025 21:37 UTC
6 points
0
in reply to: Eli Tyre’s comment on: leogao’s Shortform
I just meant that if an oracle told me ASI was coming in two years, I probably couldn’t spend down energy reserves to get more done within that timeframe compared to being told it’ll take ten years. I might feel a greater sense of urgency than I already am and perhaps end up working longer hours as a result of that, but if so that’d probably be an unendorsed emotional response I couldn’t help more than a considered plan. I kind of doubt I’d actually get more done that way. Some slack for curiosity and play is required for me to do my job well.
The stakes are already so high and time so short that varying either within an order of magnitude up or down really doesn’t change things all that much.

Lucius Bushnaq 11 Dec 2025 7:52 UTC
9 points
4
in reply to: leogao’s comment on: leogao’s Shortform
I guess figuring out whether we’re “in a bubble” just hasn’t seemed very important to me, relative to how hard it seems to determine? What effects on the strategic calculus do you think it has?
E.g. my current best guess is that I personally should just do what I can to help build the science of interpretability and learning as fast as possible, so we can get to a point where we can start doing proper alignment research and reason more legibly about why alignment might be very hard and what could go wrong. Whether we’re in a bubble or not mostly matters for that only insofar as it’s one factor influencing how much time we have left to do that research.
But I’m already going about as fast as I can anyway, so having a better estimate of timelines isn’t very action-relevant for me. And “bubble vs. no bubble” doesn’t even seem like a leading-order term in timeline uncertainty anyway.

Lucius Bushnaq 8 Dec 2025 7:10 UTC
4 points
0
in reply to: Cole Wyeth’s comment on: Why Reality Has A Well-Known Math Bias
Yeah, the observation that the universe seems maybe well-predicted by a program running on some UTM is a subset of the observation that the universe seems amendable to mathematical description and compression. So the former observation isn’t really an explanation for the latter, just a kind of restatement. We’d need an argument for why a prior over random programs running on an UTM should be preferred over a prior over random strings. Why does the universe have structure? The Universal Prior isn’t an answer to that question. It’s just an attempt to write down a sensible prior that takes the observation that the universe is structured and apparently predictable into account.

Lucius Bushnaq 1 Dec 2025 14:12 UTC
4 points
2
in reply to: Hastings’s comment on: Toward Statistical Mechanics Of Interfaces Under Selection Pressure
See footnote. Since this permutation freedom always exists no matter what the learned algorithm is, it can’t tell us anything about the learned algorithm.

Lucius Bushnaq 27 Nov 2025 16:19 UTC
4 points
0
in reply to: Leon Lang’s comment on: A Technical Introduction to Solomonoff Induction without K-Complexity
… Wait, are you saying we’re not propagating updates into $ν$ to change the mass it puts on inputs $0$ vs. $1$ ?

Lucius Bushnaq 27 Nov 2025 16:13 UTC
5 points
2
in reply to: Leon Lang’s comment on: A Technical Introduction to Solomonoff Induction without K-Complexity
My viewpoint is that the prior distributions giving weight $1 / 3$ to each of the three hypotheses is different from the one giving weight $1 / 2$ to each of $ν_{0}$ and $ν_{1}$ , even if their mixture distributions are exactly the same.
That’s pretty unintuitive to me. What does it matter whether we happen to write out our belief state one way or the other? So long as the predictions come out the same, what we do and don’t choose to call our ‘hypotheses’ doesn’t seem particularly relevant for anything?
We made our choice when we settled on $M$ as the prior. Everything past that point just seems like different choices of notation to me? If our induction procedure turned out to be wrong or suboptimal, it’d be because $M$ was a bad prior to pick, not because we happened to write $M$ down in a weird way, right?

Lucius Bushnaq 27 Nov 2025 8:14 UTC
4 points
0
in reply to: Leon Lang’s comment on: A Technical Introduction to Solomonoff Induction without K-Complexity
If they have the same prior on sequences/histories, then in what relevant sense are they not the same prior on hypotheses? If they both sum to $M (x)$ , how can their predictions come to differ?

Lucius Bushnaq 26 Nov 2025 22:57 UTC
8 points
0
on: A Technical Introduction to Solomonoff Induction without K-Complexity
I’m confused. Isn’t one of the standard justifications for the Solomonoff prior that you can get it without talking about K-complexity, just by assuming a uniform prior over programs of length $l$ on a universal monotone Turing machine and letting $l$ tend to infinity?^[1] How is that different from your $P_{a p} (ν)$ ? It’s got to be different right, since you say that $P_{a p} (ν)$ is not equivalent to the Solomonoff prior.
1. ^
  See e.g. An Introduction to Universal Artifical Intelligence, pages 145 and 146.

Lucius Bushnaq 26 Nov 2025 21:57 UTC
6 points
0
in reply to: jake_mendel’s comment on: Buck’s Shortform
Obviously SLT comes to mind, and some people have tried to claim that SLT suggests that neural network training is actually more like Solomonoff prior than the speed prior (e.g. bushnaq) although I think that work is pretty shaky and may well not hold up.
That post is superseded by this one. It was just a sketch I wrote up mostly to clarify my own thinking, the newer post is the finished product.
It doesn’t exactly say that neural networks have Solomonoff-style priors. It depends on the NN architecture. E.g., if your architecture is polynomials, or MLPs that only get one forward pass, I do not expect them to have a prior anything like that of a compute-bounded Universal Turing Machine.

And NN training adds in additional complications. All the results I talk about are for Bayesian learning, not things like gradient descent. I agree that this changes the picture and questions about the learnability of solutions become important. You no longer just care how much volume the solution takes up in the prior, you care how much volume each incremental building block of the solution takes up within the practically accessible search space of the update algorithm at that point in training.

Lucius Bushnaq 22 Nov 2025 12:39 UTC
17 points
0
on: Why Not Just Train For Interpretability?
I think just minimising the $L_{0}$ norm of the weights is worth a try. There’s a picture of neural network computation under which this mostly matches their native ontology. It doesn’t match their native ontology under my current picture, which is why I personally didn’t try doing this. But the empirical results here seem maybe^[1] better than I predicted they were going to be last February.

I’d also add that we just have way more compute and way better standard tools for high-dimensional nonlinear optimisation than we used to. It’s somewhat plausible to me that some AI techniques people never got to work at all in the old days could now be made to kind of work a little bit with sufficient effort and sheer brute force, maybe enough to get something on the level of an AlphaGo or GPT-2. Which is all we’d really need to unlock the most crucial advances in interp at the moment.
1. ^
  I haven’t finished digesting the paper yet, so I’m not sure.

Lucius Bushnaq 14 Nov 2025 19:08 UTC
4 points
0
on: Toward Statistical Mechanics Of Interfaces Under Selection Pressure
Problem with this: I think training tasks in real life are usually not, in fact, compatible with very many parameter settings. Unless the training task is very easy compared to the size of the model, basically all spare capacity in the model parameters will be used up eventually, because there’s never enough of it. The net can always use more, to make the loss go down a tiny bit further, predict internet text and sensory data just a tiny bit better, score a tiny bit higher on the RL reward function. If nothing else, spare capacity can always be used to memorise some more training data points. $H (Θ | A P I)$ may be maximal given the constraints, but the constraints will get tighter and tighter as training goes on and the amount of coherent structure in the net grows, until approximately every free bit is used up.^[1]
But we can still ask whether there are subsets of the training data on which the model outputs can be realised by many different parameter settings, and try to identify internal structure in the net that way, looking for parts of the parameters that are often free. If a circuit stores the fact that the Eiffel tower is in Paris, the parameter settings in that circuit will be free to vary on most inputs the net might receive, because most inputs don’t actually require the net to know that the Eiffel tower is in Paris to compute its output.
A mind may have many skills and know many facts, but only a small subset of these skills and facts will be necessary for the mind to operate at any particular moment in its computation. This induces patterns in which parts of the mind’s physical implementation are or aren’t free to vary in any given chunk of computational time, which we can then use to find the mind’s skills and stored facts inside its physical instantiation.
So, instead of doing stat mech to the loss landscape averaged over the training data, we can do stat mech to the loss landscapes, plural, at every training datapoint.
1. ^
  Some degrees of freedom will be untouched because they’re baked into the architecture, like the scale freedom of ReLU functions. But those are a small minority and also not useful for identifying the structure of the learned algorithms. Precisely because they are guaranteed to stay free no matter what algorithms are learned, they cannot contain any information about them.

Lucius Bushnaq 11 Nov 2025 10:29 UTC
20 points
17
in reply to: johnswentworth’s comment on: Legible vs. Illegible AI Safety Problems
I think on the object level, one of the ways I’d see this line of argument falling flat is this part
Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to).
I am not at all comfortable relying on nobody deploying just because there are obvious legible problems. With the right incentives and selection pressures, I think people can be amazing at not noticing or understanding obvious understandable problems. Actual illegibility does not seem required.

Lucius Bushnaq 1 Nov 2025 14:23 UTC
7 points
0
in reply to: Dalcy’s comment on: Darcy’s Shortform
In my experience, the main issue with this kind of thing is finding really central examples of symmetries in the input that are emulatable. There’s a couple easy ones, like low rank^[1] structure, but I never really managed to get a good argument for why generic symmetries in the data would often be emulatable^[2] in real life.^[3]
You might want to chat with Owen Lewis about this. He’s been thinking about connections between input symmetries and mechanistic structure for a while, and was interested in figuring out some kind of general correspondence between input symmetries and parameter symmetries.
1. ^
  If $q (x)$ only depends on a low-rank subspace of the inputs $x$ , there will usually^[4] be degrees of freedom in the weights that connect to that input vector. The same is true of the hidden activations, if they’re low rank, we get a corresponding number of free weights. See e.g. section 3.1.2 here.
2. ^
  Good name for this concept by the way, thanks.
3. ^
  For a while I was hoping that almost any kind of input symmetry would tend to correspond to low-rank structure in the hidden representations of $p (x | Θ^{*})$ , if $p (.)$ has the sort of architecture used by modern neural networks. Then, almost any kind of symmetry would be reducible to the low-rank structure case^[2], and hence almost any symmetry would be emulatable. But I never managed to show this, and I no longer think it is true.
4. ^
  There are a couple of necessary conditions for this of course. E.g. the architecture $p (.)$ needs to actually use weight matrices, like neural networks do.

Lucius Bushnaq

Ro­ta­tions in Superposition

Rotations in Superposition