Can AI developers instill such preferences before it can write books? We can’t truly observe
As for observing preferences, we might be arguably able to infer them by asking the AI what it thinks of the book.[1] The AI will either tell the truth or lie, which we might hope to understand by using more complex methods, like reading the CoT. Which we are likely to lose with sufficiently capable architectures. I suspect that this is the reason why the AI-2027 forecast has Agent-2 end up mostly aligned and Agent-3 become misaligned at the same moment as the CoT is no longer readable. Alternatively, Agent-3 could end up thinking in a CoT that we fail to understand.
On the other hand, the AI might have been faking alignment the whole time and be so adept at it that even the CoT is powerless.[2] However, it’s hard to tell this case apart from the case when the AI was aligned untilsomething in the training environment misaligned the AI. Or from the case when the AIinevitablybecomes misalignedonce it is capable of commiting takeover.
Alas, deliberately training a misaligned model to have the CoT look nice is likely to ensure that the model learns to be undetectably misaligned from OOMs less experience than SOTA models have. But we can infer the speed at which misaligned models learn to hide their misalignment from SOTA detection methods (e.g. ways to look into the AI’s feelings and potential welfare or asking the model to express itself in images, not in text).
I would argue that we can’t trust the paragraph-limited AI’s expressed preferences about character development, even if we knew it was trying to be honest. It would probably not be able to accurately report how it would behave if it actually was capable of writing books. Such capabilities are too far from its level.
It’s like the example with planning. Sure, current AIs can plan, but the plans are disconnected from task completion until they can take more active roles in executing their plans. Their planning is only aligned at a shallow level.
Suppose that Claude Sonnet N mostly prefers to play as a pacifist. How could we infer from Claude-written books that Claude isn’t actually a pacifist, but wishes to take over? Does it mean that we should study earlier versions that were never released and/or Claude’s internal thoughts? Or Claude-generated images on which no one ever did RLHF?
As for observing preferences, we might be arguably able to infer them by asking the AI what it thinks of the book.[1] The AI will either tell the truth or lie, which we might hope to understand by using more complex methods, like reading the CoT. Which we are likely to lose with sufficiently capable architectures. I suspect that this is the reason why the AI-2027 forecast has Agent-2 end up mostly aligned and Agent-3 become misaligned at the same moment as the CoT is no longer readable. Alternatively, Agent-3 could end up thinking in a CoT that we fail to understand.
On the other hand, the AI might have been faking alignment the whole time and be so adept at it that even the CoT is powerless.[2] However, it’s hard to tell this case apart from the case when the AI was aligned until something in the training environment misaligned the AI. Or from the case when the AI inevitably becomes misaligned once it is capable of commiting takeover.
Or have the AI watch as we play the game and consult the AI for advice.
Alas, deliberately training a misaligned model to have the CoT look nice is likely to ensure that the model learns to be undetectably misaligned from OOMs less experience than SOTA models have. But we can infer the speed at which misaligned models learn to hide their misalignment from SOTA detection methods (e.g. ways to look into the AI’s feelings and potential welfare or asking the model to express itself in images, not in text).
I would argue that we can’t trust the paragraph-limited AI’s expressed preferences about character development, even if we knew it was trying to be honest. It would probably not be able to accurately report how it would behave if it actually was capable of writing books. Such capabilities are too far from its level.
It’s like the example with planning. Sure, current AIs can plan, but the plans are disconnected from task completion until they can take more active roles in executing their plans. Their planning is only aligned at a shallow level.
Suppose that Claude Sonnet N mostly prefers to play as a pacifist. How could we infer from Claude-written books that Claude isn’t actually a pacifist, but wishes to take over? Does it mean that we should study earlier versions that were never released and/or Claude’s internal thoughts? Or Claude-generated images on which no one ever did RLHF?