Knight Lee comments on Knight Lee’s Shortform

Knight Lee 15 Apr 2025 6:35 UTC
2 points
0
Can anyone explain why my “Constitutional AI Sufficiency Argument” is wrong?
I strongly suspect that most people here disagree with it, but I’m left not knowing the reason.
The argument says: whether or not Constitutional AI is sufficient to align superintelligences, hinges on two key premises:
1. The AI’s capabilities on the task of evaluating its own corrigibility/honesty, is sufficient to train itself to remain corrigible/honest (assuming it starts off corrigible/honest enough to not sabotage this task).
2. It starts off corrigible/honest enough to not sabotage this self evaluation task.
My ignorant view is that so long as 1 and 2 are satisfied, the Constitutional AI can probably remain corrigible/honest even to superintelligence.
If that is the case, isn’t it an extremely important to study “how to improve the Constitutional AI’s capabilities in evaluating its own corrigibility/honesty?”
Shouldn’t we be spending a lot of effort improving this capability, and trying to apply a ton of methods towards this goal (like AI debate and other judgment improving ideas)?
At least the people who agree with Constitutional AI should be in favour of this...?
Can anyone kindly explain what am I missing? I wrote a post and I think almost nobody agreed with this argument.
Thanks :)