Towards_Keeperhood comments on How human-like do safe AI motivations need to be?

Towards_Keeperhood 17 Nov 2025 11:52 UTC
1 point
0
Thanks for writing up your views in detail!
On corrigibility:
Corrigibility was originally intended to mean that a system that has that property does not run into nearest unblocked strategy problems, unlike the kind of adversarial dynamic that exists between deontological and consequentialist preferences. In your version, the consequentialist planning to fulfill a hard task given by the operators is at odds with the deontological constraints.
I also think it is harder to get robust deontological preferences into an AI than one would expect given human intuitions. The way the human reward system is wired in a way that we robustly get positive rewards for pro-social self-reflective thoughts. Perhaps we can have another AI monitor the thoughts of our main AI and likewise reward pro-social (self-reflective?) thoughts (although I think LLM-like AIs would likely be self-reflective in a rather different way than humans). However, I think for humans our main preferences come from such approval-directed self-reflective desires, whereas I expect the way people will train AI by default will cause the main smart optimization to aim for object-level outcomes, which are then more at odds with norm-following. (See this post, especially sections 2.3 and 3.) (And also even for humans it’s not quite clear whether deontological/norm-following preferences are learned deeply enough.)
So basically, I don’t expect the main optimization of the AI to end up robustly steering in a way to fulfill deontological preferences, and it’s rather like trying to enforce deontological preferences by having a different AI monitor the AI’s thoughts and constraining it to not think virtue-specification-violating thoughts. So when the AI gets sufficiently smart you get (1) nearest-unblocked-strategy problems, like non-AI-interpretable disobedient thoughts; and/or (2) collusion if you didn’t find some other way to make your AIs properly corrigible.
Deontological preferences aren’t very natural targets for a steering process to steer toward. It somehow sometimes works out for humans because their preferences derive more from their self-image, rather than from environmental goals. But if you try to train deontological preferences into an AI with current methods, it won’t end up deeply internalizing those, but rather learns a belief that it should not think disobedient thoughts, or an outer-shell non-consequentialist constraint.
(I guess given that you acknowledge nearest-unblocked strategy problems, you might sorta agree with this, though still plausible to me that you overestimate how deep and generalizing trained-in deontological constraints would be.)
Myopic instruction-following is already going into a roughly good direction in terms of what goal to aim for, but I think if you give the AI a task, the steering process towards that task would likely not have a lot of nice corrigibility properties by default. E.g. it seems likely that in steering toward such a task, it would see the possibility of the operator telling it to stop as obstacle that would prevent the goal of task-completion. (I mean it’s a bit ambiguous how exactly to interpret instruction-following here, but I think that’s what you get by default if you naively try to train for it as current labs would.)
It would be much nicer if the powerful steering machinery isn’t steering in a way that it would naturally disempower us (absent thought and control constraints) in the first place. I think aiming for CAST would be much better: Basically, we want to point the powerful steering machinery towards a goal like “empower the principal” which then implies instruction-following and keeping the principal in control^[1]. It also has the huge advantage that steering towards roughly-CAST may be enough for the AI to want to empower the principal more, so it may try to change itself into sth like more-correct-CAST (aka Paul Christiano’s “basin of corrigibility”). (But obviously difficulties of getting the intended target instead of sth like reward-seeking AI still apply.)
1. ^
  Although not totally sure whether it works robustly, but in any case seems much much much better to aim for than sth like Anthropic’s HHH.