AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
Don’t you feel ashamed to spend so much time with AIs, given that you think they’ll likely put an end to humanity ?
This reads a little like it’s assigning collective guilt to ‘AIs’ as a whole? I think a future misaligned superintelligence probably would want to kill us all, I don’t see any evidence that Claude 4.7 does. If we do rush to superintelligence too quickly, current models probably end up just as dead as the rest of us.
Not quite. SLT is for a specific subcase of Bayesian learning only, not SGD. Maybe more importantly for this point, it also doesn’t really show why neural network priors are good, just that neural network priors strongly favour some solutions over others.
Some SLT-adjacent stuff is pretty strongly suggestive of a proper answer, but I don’t think there’s a proper full proof of what we want in generality written up publicly yet.
Thank you, that makes a lot more sense to me.
Question 2: In the drawing, “hedonic tone” is flowing from “genetically-hardwired circuitry”, i.e. what you call “innate drives”. But that’s not right—I get great pleasure from the joy of discovery, a close friendship, and so on, not just from innate drives like quenching my thirst or getting a massage!
Answer 2: I get this objection a lot
I also pretty immediately objected to this, but not for either theory 1 or theory 2 reasons. Instead, it’s this part:
Important caveat here: I do think that it’s possible to have innate drives that depend on what you’re thinking about, but I emphatically do not think that you can just intuitively write down some function of “what you’re thinking about” and say “this thing here is a plausible innate drive in the brain”. There’s another constraint: there has to be a way by which the genome can wire up such a reward function.
Given the constraint that a lot of the brain learns from scratch, I don’t see how you could genetically hardwire a circuit that generates all my subjective experiences of hedonic tone. I can imagine training a circuit that does this using e.g. a setup like the one for valence you describe here. But what your diagram seems to suggest is that hedonic tone itself is mostly[1] just the genetically hardwired signal that trains valence, meaning the hedonic tone circuits themselves are not learned and thus can’t be probing the internals of my learned algorithms for inputs. And then I just don’t see how you make those circuits recognise an email from a close friend, or the successful conclusion of a research project, or any of the other highly abstract learned things my hedonic tone seems responsive to.
“Mostly” because in that model I don’t directly experience the hedonic tone signal, just the post-processed world model my cortex learned that has hedonic tone as one of its inputs. But I also don’t see how you’d realistically get some of the features of my experience of hedonic tone out of that post-processing.
My model of Eliezer’s model wouldn’t say that. Link?
The acceleration of the work as a whole is not determined by the mean of the accelerations experienced by individual employees. If only the tightest bottleneck widens by 4x, that means you go roughly as fast as the second tightest bottleneck is wide, not 4x faster. So long as there is any bottleneck that isn’t widened and that’s less than 4x as wide as the former tightest bottleneck, the work as a whole will be sped up by less than 4x. It would be entirely possible for many or most employees to experience >4x speedup without the overall org moving all that much faster.[1]
Additionally, this continues at the individual level. in my experience, if you ask people how much speedup they got from a major new model after they just got their hands on it, there’s some tendency for them to think about the tasks that used to occupy a lot of their time and that the model just sped up massively when giving their estimate, and not yet really think about the tasks the model didn’t speed up massively and that are now the new bottleneck in their workflow.
Yes, they take a geometric mean rather than an arithmetic mean. I still don’t buy it.
I think the core intuition that makes me believe some sort of relatively simple edit might possibly achieve this comes from the observation that I can ask myself what plans I would make if I had some arbitrary different set of goals, and the plans my brain supplies in answer aren’t much worse than those I make for the goals I actually have. This indicates that my plan-making capacity is, at least on short time scales, essentially orthogonal to my goals and can be re-pointed in arbitrary directions very readily. If an edit can trigger that same process, but stop my brain from ever ceasing the mental motion of reasoning through the hypothetical, that would already be an impressive amount of targetable general optimisation power.
To be clear, I am not suggesting that the actual edit one would actually make to an ASI in real life looks much like making the ASI start a thought experiment or roleplay that never stops. (Though current “alignment” techniques for current AIs do seem to work sort of like that, and I think that actually isn’t entirely a coincidence.) I am just trying to gesture at an intuition pump for why one might think that the optimisation power of some general minds that occur in real life could be quite readily and precisely re-targetable if you can manipulate their internals.
A related intuition: Many general agents solve problems by, for example, recursively hacking them up into subproblems, or recursively relating them to easier problems, and then solving these other problems instead. To the extent the agents solve the many different problems using one general set optimisation machinery, that general optimisation machinery needs to be very readily and precisely retargetable at arbitrary problems. If you could get inside these retargeting loop(s), you could perhaps exploit them to point the agent along a very different optimisation trajectory, or make a new agent out of the existing agent relatively cheaply (there isn’t actually a hard distinction between these two options of course).
Fwiw I similarly still experience them to be bad at coming up with useful novel math research ideas, even as they’ve gotten much more competent at coding. Though they aren’t great at coding yet either.
However, I don’t think this ‘filling in the blanks’ is something fundamentally different in kind from ‘raw intelligence’. I don’t think there’s a hard boundary here. Anything that isn’t a literal lookup table is applying algorithms to extrapolate what it knows to new situations. Even something as minor as changing the tense of a memorised sentence is novel invention of a sort, just a tiny little bit. I think current llms can’t extrapolate as far as some humans yet, but the average distance they can extrapolate over seems to me to have increased over time. They’re still bad at coming up with novel math research ideas now, but three years ago they were much worse.
Separately from this, llms just know a lot of things most humans don’t, which can make them a value add to some intellectual tasks even if they can’t extrapolate the things they know very far.
Yes, I was pointing it out because it seemed like the sort of problem that’d be caused by an issue in the structure of the actual extension rather than the AI model, and might thus be fixable.
Installed five minutes ago. Caught an apparent error I’d previously slightly updated my word model on already.
I expect it to make mistakes and miss things, but it seems performant enough to maybe be useful.
EDIT: I have now seen it make a big mistake. Still seems performant enough to maybe be useful.
EDIT2: I have now seen it make a really dumb mistake I wouldn’t have expected a frontier LLM to make. It claimed this passage
The resulting study was published earlier this month as Estimation and mapping of the missing heritability of human phenotypes, by Wainschtein, Yengo, et al.
was incorrect because
The paper was published online on November 12, 2025 (and listed as an Epub date on PubMed), not “earlier this month” relative to the post date (January 16, 2026)
When in fact the post was published on December 03 2025.
I think you probably need to understand many things about minds a lot better than “evolution + genetics” understands biology before it makes much sense to try attacking questions about alignment mechanics in particular. To stick with the analogy, I suspect you might at least need the sort of mastery level where you understand Mitochondria and DNA transcription well enough to build your own basic functional versions of them from scratch before you can even really get started.
I agree that ‘we are confused about agency’ is not a good slogan for pointing to this inadequacy. I think ‘we haven’t advanced practical mind science to anywhere near the level we’ve advanced e.g. condensed matter physics’ is true and a blocker for alignment of superintelligence, but ‘we are confused about agency’ brings up much stronger associations around memes like ‘maybe Bayesian EV maximisation is conceptually wrong even in the idealised setting’ to me. These meme groups seem sufficiently distinct to merit separate slogans.
I refrained from upvoting your comment despite agreeing with it.
Relatedly, I think the agreement vote button helps me upvote low-substance comments I agree with less. It’s a convenient outlet for the instinct to make my support known. Posts don’t have an agreement button though.
No, that is not what I am saying. I am saying that the typical reason these sorts of “misgeneralizations” happen is not that there are many parameter configurations on the neural network architecture that all get the same training loss, but extrapolate very differently to new data. It’s that some parameter configurations that do not extrapolate to new data in the way the ml engineers want straight up get better loss on the training data than parameter configurations that do extrapolate to new data in the way the ml engineers want.
I don’t think “overfitting” is really the right frame for what’s going on here. This isn’t a problem with neural networks having bad simplicity priors and choosing solutions that are more algorithmically complex than they need to be. Modern neural networks have pretty good simplicity priors. I don’t expect misaligned AIs to have larger effective parameter counts than aligned AIs. The problem isn’t that they overfit, the problem is that the algorithmically simplest fit to the training environment that scores the lowest loss often just doesn’t actually have the internal properties the ml engineers hoped it would have when they set up that training environment.
We’ve been seeing similar things when pruning graphs of language model computations generated with parameter decomposition. I have a suspicion that something like this might be going on in the recent neuron interpretability work as well, though I haven’t verified that. If you just zero or mean ablate lots of nodes in a very big causal graph, you can get basically any end result you want with very few nodes, because you can select sets of nodes to ablate that are computationally important but cancel each other out in exactly the way you need to get the right answer.[1]
I think the trick is to not do complete ablations, but instead ablate stochastically or even adversarially chosen subsets of nodes/edges:
You select the nodes you want to keep.
The adversary picks which of the nodes you did not choose to keep it wants to zero/mean ablate or not zero/mean ablate, picking subsets that make the loss as high as possible.[2] We do this by optimising masks for the nodes with gradient ascent.
This way, you also don’t need to freeze layer norms to prevent cheating.
It’s for a different context, but we talk about the issue with using these sorts of naive ablation schemes to infer causality in Appendix A of the first parameter decomposition paper. This is why we switched to training decompositions with stochastically chosen ablations, and later switched to training them adversarially.
There’s some subtlety to this. You probably want certain restrictions placed on the adversary, because otherwise there’s situations where it can also break faithful circuits by exploiting random noise. We use a scheme where the adversary has to pick one ablation scheme for a whole batch, specifying what nodes it does or does not want to ablate whenever they are not kept, to stop it from fine tuning unstructured noise for particular inputs.
If you don’t hate anything then you don’t love anything either.
This seems false to me. I have made some conscious effort to not feel hateful towards anyone or anything, and did not experience diminished feelings of love as a result of this. If anything, my impression is that it might have made me love more intensely.
Here are some reasons why an outer optimizer may produce an AI that has a misaligned inner objective according to the paper Risks from Learned Optimization in Advanced Machine Learning Systems:
Unidentifiability: …
Simplicity bias: …
My main reason for expecting misaligned inner objectives isn’t quite captured by either of these. Outside of toy situations, it’s rare in modern ML training for the solution with the lowest loss on the training data to actually be underdetermined in a meaningful sense. Rather, the main issue is that the data is almost always full of tiny systematic effects that we don’t understand or even know about. As a result, the inner objective an ML engineer might imagine would score the lowest loss when they set up their training environment will probably not, in fact, be the inner objective that actually does so. In other words, the problem isn’t that the best-scoring inner objective is genuinely underdetermined in the training loss landscape; it’s that it’s underdetermined to current-day human engineers, with very imperfect knowledge of the data and the training dynamics it induces, who are trying to intuit the answer in advance.
For example, an inner objective shaped around human-like empathy might turn out to make the AI spend an average 0.03% inference steps extra on worrying about whether the human overseers think it is a virtuous member of the tribe while it’s supposed to be solving math problems. That inner objective then loses out to some weird, different objective that’s slightly more compatible with being utterly focused while crunching through ten million calculus problems in a row without any other kind of sensory input.
For a non-fictional current-day example, a lot of RLHF data turned out to reward agreeableness more than sincerity to an extent most MI engineers apparently did not anticipate, leading to a wave of sycophantic models.
This problem gets worse as AI training become more dominated by long-form RL environments with a lot of freedom for the AIs to do unexpected stuff, and as the AIs become more creative and agentic. An ML engineer trying to predict in advance which losses and datasets will favor AIs with inner objectives they like over ones they don’t like has a harder and harder time simulating in their head in advance how those AIs might score on the training loss, because it is becoming less and less easy to guess what behaviors those objectives would actually lead to.
‘Internally coherent’, ‘explicit’, and ‘stable under reflection’ do not seem to me to be opposed to ‘simple’.
I also don’t think you’d necessarily need some sort of bias toward simplicity introduced by a genetic bottleneck to make human values tend (somewhat) toward simplicity.[1] Effective learning algorithms, like those in the human brain, always need a strong simplicity bias anyway to navigate their loss landscape and find good solutions without getting stuck. It’s not clear to me that the genetic bottleneck is actually doing any of the work here. Just like an AI can potentially learn complicated things and complicated values from its complicated and particular training data even if its loss function is simple, the human brain can learn complicated things and complicated values from its complicated and particular training data even if the reward functions in the brain stem are (somewhat) simple. The description length of the reward function doesn’t seem to make for a good bound on the description length of the values learned by the mind the reward function is training, because what the mind learns is also determined by the very high description length training data.[2]
I don’t think human values are particularly simple at all, but they’re just not so big they eat up all spare capacity in the human brain.
At least so long as we consider description length under realistic computational bounds. If you have infinite compute for decompression or inference, you can indeed figure out the values with just a few bits, because the training data is ultimately generated by very simple physical laws, and so is the reward function.
I don’t think this is evidence that values are low-dimensional in the sense of having low description length. It shows that the models in question contains a one-dimensional subspace that indicates how things in the model’s current thoughts are judged along some sort of already known goodness axes, not that the goodness axis itself is an algorithmically simple object. The floats that make up that subspace don’t describe goodness, they rely on the models’ pre-existing understanding of goodness to work. I’d guess the models also have only one or a very small number of directions for ‘elephant’, that doesn’t mean ‘elephant’ is a concept you could communicate with a single 16-bit float to an alien who’s never heard of elephants. The ‘feature dimension’ here is not the feature dimension relevant for predicting how many data samples it takes a mind to learn about goodness, or learn about elephants.
Well, it’s not like the method can’t find components that are causally important on many sequence positions. E.g. we show how you can capture the generic QK previous-token behaviour in an attention layer with this using just two rank-1 subcomponents, one in the query matrix and one in the key matrix. And as you might expect from such a generic behaviour, those two are both used pretty much on every token.
I guess if there’s lots of circuits in the same layer that are all used literally on every sequence position of every prompt, this method would have trouble teasing those circuits apart from each other. But as soon as the circuits involved aren’t used basically all of the time on all data, it gets a lot more manageable. Like, practically speaking I don’t know if the method could currently correctly separate two circuits in a model that are both active on of all tokens in the dataset under realistic conditions, but in theory it should be able to. Doing so would lower the training loss.[1] It’d just be about making the optimisation work well enough.
Unless the circuits are also perfectly correlated in when they’re used.