AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
Massive overparametrization isn’t actually necessary for finding well-generalising solutions. Contrary to some old double descent ideas, you don’t need to overfit and then grok, if you do your job right and don’t screw up weight regularisation you can just smoothly learn solutions that generalise well.
Maybe you just mean “Spend FLOPs on bigger models at the cost of shorter training runs”, in which case, sure, that’s a thing one can try. I’m not going to speculate on whether it’d work or not because I don’t want to help increase model capabilities.
More importantly:
and by giving true generalization, provide a sturdy foundation for AI safety in the form of useful NNs which are aligned & safe for the right reasons.
I do not think our issues here are primarily caused by a lack of generalization ability. The problem isn’t that the AIs are overfitting. GPT-4o’s sycophancy worked pretty well for it in interactions both inside and outside its training data. The problem is that it is hard to predict in advance which exact inner objectives and other complex internal properties of AIs a given training environment will induce. And because this is hard to predict, it is difficult for engineers to successfully design a training setup on which AIs with complex internal properties they want are selected for over AIs with complex internal properties they don’t want.
Making the AIs generalise better only makes that task even harder, because the more creative and agentic the AIs get, the harder it becomes for engineers to correctly guess what thoughts an AI with a given internal objective or proclivity might think in response to a given situation in training.
But it is far more representative of someone’s character if they choose to be kind if they indeed have the capacity, are not scared of the consequences, and have considered it.
So if I wish to see true kindness, then I must also see the capacity for cruelty.
Something about this feels off to me. How do we tell from the outside whether people have ‘the capacity for cruelty’ if they are just very nice and have a lot of deliberate practice and ingrained habits for not being mean? Do we wish to advantage those who aren’t much good at this and slip up frequently? This seems like the kind of heuristic that’d reward people who make themselves look mildly threatening to manipulate people’s status emotions and be perceived as powerful over those who very deliberately try to appear as non-threatening as possible in every social micro-interaction because they actually really want people to be comfortable.
Can’t we just praise people for being kind and calling out bad behaviour both, and scold them for being mean and not calling out bad behaviour both? Do we really need to have some sort of preemptive moral judgement of people’s character build up in our heads in advance of actually observing these things?
The paper provides the original output the model gave before any rewriting, starting on page 3. I was kind of expecting a big mess, but it’s really not. It’s pretty short by the standards of tricky proofs. Two and a half pages, most of it text.
I dunno. Maybe the mathematical realists would say that this is one of the very few things that actually are nailed down to one particular option in the laws of metaphysical reality, rather than all the mathematically self-consistent options getting a little bit of reality fluid?
It seems sort of counter to the ethos of the whole endeavour as I understand it, but I don’t really see any other way for them to do it. It seems to me like you’ve got to make some statements of the form “metareality is like this thing, not like this other thing” at some point, if you want to make meaningful statements about some sort of metareality at all. And any statement like that will presumably end up being expressible mathematically.
You can get a Solomonoff prior by just taking the uniform prior over programs of length
Of course that still doesn’t explain why the objects we are distributing reality fluid over should be programs for a monotone UTM instead of something else.
So, not a prefix-free encoding.
How confident are we that this is actually true? When I’ve heard about this claim in the past, the actual evidence mentioned looked sort of thin to me when you broke things down.
I have not read this properly yet, but at a glance this looks good to me, and I would like there to be more of this kind of thing.
Well, it’s not like the method can’t find components that are causally important on many sequence positions. E.g. we show how you can capture the generic QK previous-token behaviour in an attention layer with this using just two rank-1 subcomponents, one in the query matrix and one in the key matrix. And as you might expect from such a generic behaviour, those two are both used pretty much on every token.
I guess if there’s lots of circuits in the same layer that are all used literally on every sequence position of every prompt, this method would have trouble teasing those circuits apart from each other. But as soon as the circuits involved aren’t used basically all of the time on all data, it gets a lot more manageable. Like, practically speaking I don’t know if the method could currently correctly separate two circuits in a model that are both active on
Unless the circuits are also perfectly correlated in when they’re used.
Don’t you feel ashamed to spend so much time with AIs, given that you think they’ll likely put an end to humanity ?
This reads a little like it’s assigning collective guilt to ‘AIs’ as a whole? I think a future misaligned superintelligence probably would want to kill us all, I don’t see any evidence that Claude 4.7 does. If we do rush to superintelligence too quickly, current models probably end up just as dead as the rest of us.
Not quite. SLT is for a specific subcase of Bayesian learning only, not SGD. Maybe more importantly for this point, it also doesn’t really show why neural network priors are good, just that neural network priors strongly favour some solutions over others.
Some SLT-adjacent stuff is pretty strongly suggestive of a proper answer, but I don’t think there’s a proper full proof of what we want in generality written up publicly yet.
Thank you, that makes a lot more sense to me.
Question 2: In the drawing, “hedonic tone” is flowing from “genetically-hardwired circuitry”, i.e. what you call “innate drives”. But that’s not right—I get great pleasure from the joy of discovery, a close friendship, and so on, not just from innate drives like quenching my thirst or getting a massage!
Answer 2: I get this objection a lot
I also pretty immediately objected to this, but not for either theory 1 or theory 2 reasons. Instead, it’s this part:
Important caveat here: I do think that it’s possible to have innate drives that depend on what you’re thinking about, but I emphatically do not think that you can just intuitively write down some function of “what you’re thinking about” and say “this thing here is a plausible innate drive in the brain”. There’s another constraint: there has to be a way by which the genome can wire up such a reward function.
Given the constraint that a lot of the brain learns from scratch, I don’t see how you could genetically hardwire a circuit that generates all my subjective experiences of hedonic tone. I can imagine training a circuit that does this using e.g. a setup like the one for valence you describe here. But what your diagram seems to suggest is that hedonic tone itself is mostly[1] just the genetically hardwired signal that trains valence, meaning the hedonic tone circuits themselves are not learned and thus can’t be probing the internals of my learned algorithms for inputs. And then I just don’t see how you make those circuits recognise an email from a close friend, or the successful conclusion of a research project, or any of the other highly abstract learned things my hedonic tone seems responsive to.
“Mostly” because in that model I don’t directly experience the hedonic tone signal, just the post-processed world model my cortex learned that has hedonic tone as one of its inputs. But I also don’t see how you’d realistically get some of the features of my experience of hedonic tone out of that post-processing.
My model of Eliezer’s model wouldn’t say that. Link?
The acceleration of the work as a whole is not determined by the mean of the accelerations experienced by individual employees. If only the tightest bottleneck widens by 4x, that means you go roughly as fast as the second tightest bottleneck is wide, not 4x faster. So long as there is any bottleneck that isn’t widened and that’s less than 4x as wide as the former tightest bottleneck, the work as a whole will be sped up by less than 4x. It would be entirely possible for many or most employees to experience >4x speedup without the overall org moving all that much faster.[1]
Additionally, this continues at the individual level. in my experience, if you ask people how much speedup they got from a major new model after they just got their hands on it, there’s some tendency for them to think about the tasks that used to occupy a lot of their time and that the model just sped up massively when giving their estimate, and not yet really think about the tasks the model didn’t speed up massively and that are now the new bottleneck in their workflow.
Yes, they take a geometric mean rather than an arithmetic mean. I still don’t buy it.
I think the core intuition that makes me believe some sort of relatively simple edit might possibly achieve this comes from the observation that I can ask myself what plans I would make if I had some arbitrary different set of goals, and the plans my brain supplies in answer aren’t much worse than those I make for the goals I actually have. This indicates that my plan-making capacity is, at least on short time scales, essentially orthogonal to my goals and can be re-pointed in arbitrary directions very readily. If an edit can trigger that same process, but stop my brain from ever ceasing the mental motion of reasoning through the hypothetical, that would already be an impressive amount of targetable general optimisation power.
To be clear, I am not suggesting that the actual edit one would actually make to an ASI in real life looks much like making the ASI start a thought experiment or roleplay that never stops. (Though current “alignment” techniques for current AIs do seem to work sort of like that, and I think that actually isn’t entirely a coincidence.) I am just trying to gesture at an intuition pump for why one might think that the optimisation power of some general minds that occur in real life could be quite readily and precisely re-targetable if you can manipulate their internals.
A related intuition: Many general agents solve problems by, for example, recursively hacking them up into subproblems, or recursively relating them to easier problems, and then solving these other problems instead. To the extent the agents solve the many different problems using one general set optimisation machinery, that general optimisation machinery needs to be very readily and precisely retargetable at arbitrary problems. If you could get inside these retargeting loop(s), you could perhaps exploit them to point the agent along a very different optimisation trajectory, or make a new agent out of the existing agent relatively cheaply (there isn’t actually a hard distinction between these two options of course).
Fwiw I similarly still experience them to be bad at coming up with useful novel math research ideas, even as they’ve gotten much more competent at coding. Though they aren’t great at coding yet either.
However, I don’t think this ‘filling in the blanks’ is something fundamentally different in kind from ‘raw intelligence’. I don’t think there’s a hard boundary here. Anything that isn’t a literal lookup table is applying algorithms to extrapolate what it knows to new situations. Even something as minor as changing the tense of a memorised sentence is novel invention of a sort, just a tiny little bit. I think current llms can’t extrapolate as far as some humans yet, but the average distance they can extrapolate over seems to me to have increased over time. They’re still bad at coming up with novel math research ideas now, but three years ago they were much worse.
Separately from this, llms just know a lot of things most humans don’t, which can make them a value add to some intellectual tasks even if they can’t extrapolate the things they know very far.
Yes, I was pointing it out because it seemed like the sort of problem that’d be caused by an issue in the structure of the actual extension rather than the AI model, and might thus be fixable.
Installed five minutes ago. Caught an apparent error I’d previously slightly updated my word model on already.
I expect it to make mistakes and miss things, but it seems performant enough to maybe be useful.
EDIT: I have now seen it make a big mistake. Still seems performant enough to maybe be useful.
EDIT2: I have now seen it make a really dumb mistake I wouldn’t have expected a frontier LLM to make. It claimed this passage
The resulting study was published earlier this month as Estimation and mapping of the missing heritability of human phenotypes, by Wainschtein, Yengo, et al.
was incorrect because
The paper was published online on November 12, 2025 (and listed as an Epub date on PubMed), not “earlier this month” relative to the post date (January 16, 2026)
When in fact the post was published on December 03 2025.
As one of the “finite list of variables” people[1]: This is because at the moment, I primarily want to find and understand the variables underlying the general mechanisms which AIs use to have many different kinds of productive thoughts in the first place. I am not particularly trying to find and understand variables defined only within the causal structure of these thoughts. I believe the former might indeed be described as a pre-determined finite list of variables. I agree the latter can’t be, at least not usefully.[2]
To use your analogy: I think of myself as trying to understand something like the basic makeup of a UTM, figuring out the tapes, heads, registers, tables and so on. I am not yet trying to say very much about the inner structures of the many different programs that could be run on that UTM.
I agree that some “finite list of variables” people seem to me to not distinguish between these different levels. I think that this is probably a mistake.
Loosely speaking.
With a finite context window and finite external memory there is technically a ceiling on how many different thoughts an AI is capable of having.