AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
Installed five minutes ago. Caught an apparent error I’d previously slightly updated my word model on already.
I expect it to make mistakes and miss things, but it seems performant enough to maybe be useful.
EDIT: I have now seen it make a big mistake. Still seems performant enough to maybe be useful.
EDIT2: I have now seen it make a really dumb mistake I wouldn’t have expected a frontier LLM to make. It claimed this passage
The resulting study was published earlier this month as Estimation and mapping of the missing heritability of human phenotypes, by Wainschtein, Yengo, et al.
was incorrect because
The paper was published online on November 12, 2025 (and listed as an Epub date on PubMed), not “earlier this month” relative to the post date (January 16, 2026)
When in fact the post was published on December 03 2025.
I think you probably need to understand many things about minds a lot better than “evolution + genetics” understands biology before it makes much sense to try attacking questions about alignment mechanics in particular. To stick with the analogy, I suspect you might at least need the sort of mastery level where you understand Mitochondria and DNA transcription well enough to build your own basic functional versions of them from scratch before you can even really get started.
I agree that ‘we are confused about agency’ is not a good slogan for pointing to this inadequacy. I think ‘we haven’t advanced practical mind science to anywhere near the level we’ve advanced e.g. condensed matter physics’ is true and a blocker for alignment of superintelligence, but ‘we are confused about agency’ brings up much stronger associations around memes like ‘maybe Bayesian EV maximisation is conceptually wrong even in the idealised setting’ to me. These meme groups seem sufficiently distinct to merit separate slogans.
I refrained from upvoting your comment despite agreeing with it.
Relatedly, I think the agreement vote button helps me upvote low-substance comments I agree with less. It’s a convenient outlet for the instinct to make my support known. Posts don’t have an agreement button though.
No, that is not what I am saying. I am saying that the typical reason these sorts of “misgeneralizations” happen is not that there are many parameter configurations on the neural network architecture that all get the same training loss, but extrapolate very differently to new data. It’s that some parameter configurations that do not extrapolate to new data in the way the ml engineers want straight up get better loss on the training data than parameter configurations that do extrapolate to new data in the way the ml engineers want.
I don’t think “overfitting” is really the right frame for what’s going on here. This isn’t a problem with neural networks having bad simplicity priors and choosing solutions that are more algorithmically complex than they need to be. Modern neural networks have pretty good simplicity priors. I don’t expect misaligned AIs to have larger effective parameter counts than aligned AIs. The problem isn’t that they overfit, the problem is that the algorithmically simplest fit to the training environment that scores the lowest loss often just doesn’t actually have the internal properties the ml engineers hoped it would have when they set up that training environment.
We’ve been seeing similar things when pruning graphs of language model computations generated with parameter decomposition. I have a suspicion that something like this might be going on in the recent neuron interpretability work as well, though I haven’t verified that. If you just zero or mean ablate lots of nodes in a very big causal graph, you can get basically any end result you want with very few nodes, because you can select sets of nodes to ablate that are computationally important but cancel each other out in exactly the way you need to get the right answer.[1]
I think the trick is to not do complete ablations, but instead ablate stochastically or even adversarially chosen subsets of nodes/edges:
You select the nodes you want to keep.
The adversary picks which of the nodes you did not choose to keep it wants to zero/mean ablate or not zero/mean ablate, picking subsets that make the loss as high as possible.[2] We do this by optimising masks for the nodes with gradient ascent.
This way, you also don’t need to freeze layer norms to prevent cheating.
It’s for a different context, but we talk about the issue with using these sorts of naive ablation schemes to infer causality in Appendix A of the first parameter decomposition paper. This is why we switched to training decompositions with stochastically chosen ablations, and later switched to training them adversarially.
There’s some subtlety to this. You probably want certain restrictions placed on the adversary, because otherwise there’s situations where it can also break faithful circuits by exploiting random noise. We use a scheme where the adversary has to pick one ablation scheme for a whole batch, specifying what nodes it does or does not want to ablate whenever they are not kept, to stop it from fine tuning unstructured noise for particular inputs.
If you don’t hate anything then you don’t love anything either.
This seems false to me. I have made some conscious effort to not feel hateful towards anyone or anything, and did not experience diminished feelings of love as a result of this. If anything, my impression is that it might have made me love more intensely.
Here are some reasons why an outer optimizer may produce an AI that has a misaligned inner objective according to the paper Risks from Learned Optimization in Advanced Machine Learning Systems:
Unidentifiability: …
Simplicity bias: …
My main reason for expecting misaligned inner objectives isn’t quite captured by either of these. Outside of toy situations, it’s rare in modern ML training for the solution with the lowest loss on the training data to actually be underdetermined in a meaningful sense. Rather, the main issue is that the data is almost always full of tiny systematic effects that we don’t understand or even know about. As a result, the inner objective an ML engineer might imagine would score the lowest loss when they set up their training environment will probably not, in fact, be the inner objective that actually does so. In other words, the problem isn’t that the best-scoring inner objective is genuinely underdetermined in the training loss landscape; it’s that it’s underdetermined to current-day human engineers, with very imperfect knowledge of the data and the training dynamics it induces, who are trying to intuit the answer in advance.
For example, an inner objective shaped around human-like empathy might turn out to make the AI spend an average 0.03% inference steps extra on worrying about whether the human overseers think it is a virtuous member of the tribe while it’s supposed to be solving math problems. That inner objective then loses out to some weird, different objective that’s slightly more compatible with being utterly focused while crunching through ten million calculus problems in a row without any other kind of sensory input.
For a non-fictional current-day example, a lot of RLHF data turned out to reward agreeableness more than sincerity to an extent most MI engineers apparently did not anticipate, leading to a wave of sycophantic models.
This problem gets worse as AI training become more dominated by long-form RL environments with a lot of freedom for the AIs to do unexpected stuff, and as the AIs become more creative and agentic. An ML engineer trying to predict in advance which losses and datasets will favor AIs with inner objectives they like over ones they don’t like has a harder and harder time simulating in their head in advance how those AIs might score on the training loss, because it is becoming less and less easy to guess what behaviors those objectives would actually lead to.
‘Internally coherent’, ‘explicit’, and ‘stable under reflection’ do not seem to me to be opposed to ‘simple’.
I also don’t think you’d necessarily need some sort of bias toward simplicity introduced by a genetic bottleneck to make human values tend (somewhat) toward simplicity.[1] Effective learning algorithms, like those in the human brain, always need a strong simplicity bias anyway to navigate their loss landscape and find good solutions without getting stuck. It’s not clear to me that the genetic bottleneck is actually doing any of the work here. Just like an AI can potentially learn complicated things and complicated values from its complicated and particular training data even if its loss function is simple, the human brain can learn complicated things and complicated values from its complicated and particular training data even if the reward functions in the brain stem are (somewhat) simple. The description length of the reward function doesn’t seem to make for a good bound on the description length of the values learned by the mind the reward function is training, because what the mind learns is also determined by the very high description length training data.[2]
I don’t think human values are particularly simple at all, but they’re just not so big they eat up all spare capacity in the human brain.
At least so long as we consider description length under realistic computational bounds. If you have infinite compute for decompression or inference, you can indeed figure out the values with just a few bits, because the training data is ultimately generated by very simple physical laws, and so is the reward function.
I don’t think this is evidence that values are low-dimensional in the sense of having low description length. It shows that the models in question contains a one-dimensional subspace that indicates how things in the model’s current thoughts are judged along some sort of already known goodness axes, not that the goodness axis itself is an algorithmically simple object. The floats that make up that subspace don’t describe goodness, they rely on the models’ pre-existing understanding of goodness to work. I’d guess the models also have only one or a very small number of directions for ‘elephant’, that doesn’t mean ‘elephant’ is a concept you could communicate with a single 16-bit float to an alien who’s never heard of elephants. The ‘feature dimension’ here is not the feature dimension relevant for predicting how many data samples it takes a mind to learn about goodness, or learn about elephants.
Fwiw I think I feel companionate love, to the point of sometimes experiencing a sort of regret for not being able to hug everyone in the universe, and getting emotionally attached to random trees, rocks, frozen peas[1], and old pairs of shoes[2] when I was a kid. And I also recall reading this and thinking: “Screw Green.”
After my mother explained to me that the pea seeds were intended to make new pea plants, I felt guilty for us eating them. For a while I insisted my mother throw a few frozen peas out the window into the tree line every time we cooked with them, because my ca. four year old brain figured that way at least a few of them might have some chance to become new pea plants.
Being ca. four years old, I was growing pretty quickly and got too big for my previous pair of shoes and my parents wanted to throw them away. I felt horrible for betraying the poor friendly shoes like that, so my parents allowed me to keep them on my shelf for a few years until I got old enough to internalise that shoes aren’t people and don’t have qualia.
Sorry, I’m a law dummy: Are these maximum penalties one-offs? As in, can a company just pay $1M in fines whenever they’re caught ignoring CA SB 53 and then go right on ignoring it with no escalating consequences?
Singular learning theory, or something like it, is probably a necessary foundational tool here. It doesn’t directly answer the core question about how environment structure gets represented in the net, but it does give us the right mental picture for thinking about things being “learned by the net” at all. (Though if you just want to understand the mental picture, this video is probably more helpful than reading a bunch of SLT.)
I think this is probably wrong. Vanilla SLT describes a toy case of how Bayesian learning on neural networks works. I think there is a big difference between Bayesian learning, which requires visiting every single point in the loss landscape and trying them all out on every data point, and local learning algorithms, such as evolution, stochastic gradient descent, AdamW, etc., which try to find a good solution using information from just a small number of local neighbourhoods in the loss landscape. Those local learning algorithms are the ones I’d expect to be used by real minds, because they’re much more compute efficient.
I think this locality property matters a lot. It introduces additional, important constraints on what nets can feasibly learn. It’s where path dependence in learning comes from. I think vanilla SLT was probably a good tutorial for us before delving into the more realistic and complicated local learning case, but there’s still work to do to get us to an actually roughly accurate model of how nets learn things.
If a solution consists of internal pieces of machinery that need to be arranged exactly right to do anything useful at all, a local algorithm will need something like update steps to learn it.[1] In other words, it won’t do better than a random walk that aimlessly wanders around the loss landscape until it runs into a point with low loss by sheer chance. But if a solution with internal pieces of machinery can instead be learned in small chunks that each individually decrease the loss a little bit, the leading term in the number of update steps required to find that solution scales exponentially with the size of the single biggest solution chunk, rather than with the size of the whole solution. So, if the biggest chunk had size , the total learning time will be around .[2]
For an example where the solution cannot be learned in chunks like this, see the subset parity learning problem, where SGD really does need a number of update steps exponential in the effective parameter count of the whole solution to learn. Which for most practical purposes means it cannot learn the solution at all.
For a net to learn a big and complicated solution with high Local Learning Coefficient (LLC), it needs a learning story to find the solution’s basin in the loss landscape in a feasible timeframe. It can’t just rely on random walking, that takes too long. The expected total time it takes the net to get to a basin is, I think, determined mostly by the dimensionality of the mode connections from that basin to the rest of the landscape. Not just by the dimensionality of the basin itself, as would be the case for the sort of global, Bayesian learning modelled by vanilla SLT. The geometry of those connections is the core mathematical object that reflects the structure of the learning process and determines the learnability of a solution.[3] Learning a big solution chunk that increases the total LLC by a lot in one go means needing to find a very low-dimensional mode connection to traverse. This takes a long time, because the connection interface is very small compared to the size of the search space. To learn a smaller chunk that increases the total LLC by less, the net only needs to reach a higher-dimensional mode connection, which will have an exponentially larger interface that is thus exponentially quicker to find.[4]
I agree that vanilla SLT seems like a useful tool for developing the right mental picture of how nets learn things, but it is not itself that picture. The simplified Bayesian learning case is instructive for illuminating the connection between learning and loss landscape geometry in the most basic setting, but taken on its own it’s still failing to capture a lot of the structure of learning in real minds.
Where is some constant which probably depends on the details of the update algorithm.
I’m not going to add “I think” and “I suspect” to every sentence in this comment, but you should imagine them being there. I haven’t actually worked this out in math properly or tested it.
At least for a specific dataset and architecture. Modelling changes in the geometry of the loss landscape if we allow dataset and architecture to vary based on the mind’s own decisions as it learns might be yet another complication we’ll need to deal with in the future, once we start thinking about theories of learning for RL agents with enough freedom and intelligence to pick their learning curricula themselves.
To get the rough idea across I’m focusing here on the very basic case where the “chunks” are literal pieces of the final solution and each of them lowers the loss a little and increases the total LLC a little. In general, this doesn’t have to be true though. For example, a solution D with effective parameter count 120 might be learned by first learning independent chunks A and B, each with effective parameter count 50, then learning a chunk C with effective parameter count 30 which connects the formerly independent A and B together into a single mechanistic whole to form solution D. The expected number of update steps in this learning story would be .
This was my favourite solstice to date. Thank you.
I just meant that if an oracle told me ASI was coming in two years, I probably couldn’t spend down energy reserves to get more done within that timeframe compared to being told it’ll take ten years. I might feel a greater sense of urgency than I already am and perhaps end up working longer hours as a result of that, but if so that’d probably be an unendorsed emotional response I couldn’t help more than a considered plan. I kind of doubt I’d actually get more done that way. Some slack for curiosity and play is required for me to do my job well.
The stakes are already so high and time so short that varying either within an order of magnitude up or down really doesn’t change things all that much.
I guess figuring out whether we’re “in a bubble” just hasn’t seemed very important to me, relative to how hard it seems to determine? What effects on the strategic calculus do you think it has?
E.g. my current best guess is that I personally should just do what I can to help build the science of interpretability and learning as fast as possible, so we can get to a point where we can start doing proper alignment research and reason more legibly about why alignment might be very hard and what could go wrong. Whether we’re in a bubble or not mostly matters for that only insofar as it’s one factor influencing how much time we have left to do that research.
But I’m already going about as fast as I can anyway, so having a better estimate of timelines isn’t very action-relevant for me. And “bubble vs. no bubble” doesn’t even seem like a leading-order term in timeline uncertainty anyway.
Yeah, the observation that the universe seems maybe well-predicted by a program running on some UTM is a subset of the observation that the universe seems amendable to mathematical description and compression. So the former observation isn’t really an explanation for the latter, just a kind of restatement. We’d need an argument for why a prior over random programs running on an UTM should be preferred over a prior over random strings. Why does the universe have structure? The Universal Prior isn’t an answer to that question. It’s just an attempt to write down a sensible prior that takes the observation that the universe is structured and apparently predictable into account.
See footnote. Since this permutation freedom always exists no matter what the learned algorithm is, it can’t tell us anything about the learned algorithm.
… Wait, are you saying we’re not propagating updates into to change the mass it puts on inputs vs. ?
Yes, I was pointing it out because it seemed like the sort of problem that’d be caused by an issue in the structure of the actual extension rather than the AI model, and might thus be fixable.