Towards_Keeperhood comments on Serious Flaws in CAST

Towards_Keeperhood 19 Nov 2025 19:22 UTC
4 points
0
Nice post! And being scared of minus signs seems like a nice lesson.
Absent a greater degree of theoretical understanding, I now expect the feedback loop of noticing and addressing flaws to vanish quickly, far in advance of getting an agent that has fully internalized corrigibility such that it’s robust to distributional and ontological shifts.
My motivation for corrigibility isn’t that it scales all that far, but that we can more safely and effectively elicit useful work out of corrigible AIs than out of sycophants/reward-on-the-episode-seekers (let alone schemers).
E.g. current approaches to corrigibility still rely on short-term preferences, but when the AI gets smarter and its ontology drifts so it sees itself as agent embedded in multiple places in greater reality, short-term preferences become much less natural. This probably-corrigibility-breaking shift already happens around Eliezer level if you’re trying to use the AI to do alignment research. Doing alignment research makes it more likely that such breaks occur earlier, also because the AI would need to reason about stuff like “what if an AI reflects on itself in this dangerous value-breaking way” which is sorta close to the AI reflecting itself in that way. Not that it’s necessarily impossible to use corrigible AI to help with alighment research, but we might be able to get a chunk further in capability if we make the AI not think about alignment stuff and instead just focus on e.g. biotech research for human intelligence augmentation, and that generally seems like a better plan to me.
I’m pretty unsure, but I currently think that if we tried not too badly (by which I mean much better than any of the leading labs seem on track to try, but not requiring fancy new techniques), we may have sth like a 10-75%^[1] chance of getting a +5.5SD corrigible AI. And if a leading lab is sane enough to try a well-worked-out proposal here and it works, it might be quite useful to have +5.5SD agents inside of the labs that want to empower the overseers and at least can tell them that all the current approaches suck and we need to aim for international cooperation to get a lot more time (and then maybe human augmentation). (Rather than having sycophantic AIs that just tell the overseers what they want to hear.)
So I’m still excited about corrigibility even though I don’t expect it to scale.
$p o w e r^{'} (x) = E_{v \sim Q (V), v^{'} \sim Q (V), a \sim P (A | x, v), d \sim P (D | x, v^{'}, a)} [v (d)]$
Restructuring it this way makes it more attractive for the AI to optimize things according to typical/simple values if the human’s action doesn’t sharply identify their revealed preferences. This seems bad.
The way I would interpret “values” in your proposal is like “sorta-short-term goals a principle might want to get fulfilled”. I think it’s probably fine if we just learn a prior over what sort of sorta-short-term goals a human may have, and then use that prior instead of Q. (Or not?) If so, this notion of power seems fine to me.
(If you have time, I also would be still interested in your rough take on my original question.)
1. ^
  (wide range because I haven’t thought much about it yet)