One justification for treating “schemers” as a fairly binary classification: there’s a phase transition between the AIs lying with 99% probability versus with 100% probability, which is that techniques like “just ask different copies of the model” or “ask the AI with higher temperature” or “ask the AI with slightly different phrasings” etc stop working. Like, these techniques work unless the deception is universal, reliable, robust, and persistent.
That’s incorrect, because it’s also possible for an AI to falsely confess to scheming. This also happens to humans, e.g. if you keep asking criminal suspects loaded questions. And so there may never actually be a phase transition, because a “schemer” that tells the truth 1% of the time may not be distinguishable from an honest AI that falsely confesses 1% of the time.
One justification for treating “schemers” as a fairly binary classification: there’s a phase transition between the AIs lying with 99% probability versus with 100% probability, which is that techniques like “just ask different copies of the model” or “ask the AI with higher temperature” or “ask the AI with slightly different phrasings” etc stop working. Like, these techniques work unless the deception is universal, reliable, robust, and persistent.
That’s incorrect, because it’s also possible for an AI to falsely confess to scheming. This also happens to humans, e.g. if you keep asking criminal suspects loaded questions. And so there may never actually be a phase transition, because a “schemer” that tells the truth 1% of the time may not be distinguishable from an honest AI that falsely confesses 1% of the time.