Elliot Callender comments on Aligning Superintelligent Humans

Elliot Callender 4 Jun 2026 22:28 UTC
3 points
0
I’d guess that an important class of potential successes with this kind of scheme, in fact maybe most of the successes (but idk), involve the fooming mind [keeping a promise]/[maintaining a commitment]. I think that maintaining some kind of kindness without a specific commitment to helping existing humans in some way can easily “misgeneralize” to eg some sort of utilitarianism
Good point, yeah. I’m still confident in the overall machinery of “better understanding of one’s cognition and tooling to self-modify commensurately” → stability; but I really don’t have a principled way to select for this. I’m pretty confident Eliezer has demonstrated himself committed in this sense (see “genre savviness”), but I don’t know anyone else who would be a good starting point.
and nearly every kind of utilitarianism endorses the atoms and negentropy of all existing people being used for something else, or just more generally misgeneralize to caring about new people you can create and various other possible beings and activities over existing people.
Locally valid but connotationally wrong when read through; like, yes, we definitely lose a huge chunk of humanity-CEV in this scenario (which is what actually matters unless atemporal trade with our Everett branches can remedy the holes), but I’d expect a “kindess-foomed” entity to probably not kill people to repurpose their atoms for other entities. A hedonium-foom would, sure, but killing isn’t particularly kind to most people.
Most people currently thinking about AI alignment seem to hope that there is some sort of “formula” for safely/[value/character-preservingly]/whatever becoming more capable (and for alignment more broadly). I doubt there is some such formula to be found. Instead, I think that as one becomes more capable, one should keep thinking carefully about how to become more capable, and that there isn’t some “formula” for how to do this thinking.
A priori, I’m about 95% confident that there’s some coherent and robust math for Vingean reflection which we have yet to invent. But our chances of cracking it before ASI / human superintelligence HSI are quite thin, like maybe 10%, on my modal model.
many people seem to think that it would be just fine to let 2025 Claude foom
Claude 3.5 or any other LLM have vastly worse cognitive attractor dynamics under self-modification than humans do given a commercially-induced RSI capability. I have a draft story about this sort of thing; but basically, the internals range from maybe-aligned-but-horribly-incapable to unaligned-and-incapable to unaligned-and-capable-of-RSI; nowhere along that Pareto frontier do we see something as stable (wrt raw-utility-as-would-be-galactically-amortized post-foom) as a mildly above-average human.
I can however imagine models two generations from now, were they aligned like Opus 3.5, being sufficiently stable in the comparably more narrow action domain of doing a pivotal act to actually bump themselves another 2 generations’ worth of capacity and just execute a pivotal act. But I really don’t think we’ll get Claude Legolas 8.6 aligned like that (P < 0.03).
how do you maintain a belief in god over very much thinking / capability gain (and the thing being basically false)
Might be useful, but this conflates instrumental epistemics with a normative/value thing (which you acknowledge indirectly). This gap widens under intelligence augmentation, but on my model, values become more stable.