I would argue that we can’t trust the paragraph-limited AI’s expressed preferences about character development, even if we knew it was trying to be honest. It would probably not be able to accurately report how it would behave if it actually was capable of writing books. Such capabilities are too far from its level.
It’s like the example with planning. Sure, current AIs can plan, but the plans are disconnected from task completion until they can take more active roles in executing their plans. Their planning is only aligned at a shallow level.
Suppose that Claude Sonnet N mostly prefers to play as a pacifist. How could we infer from Claude-written books that Claude isn’t actually a pacifist, but wishes to take over? Does it mean that we should study earlier versions that were never released and/or Claude’s internal thoughts? Or Claude-generated images on which no one ever did RLHF?
I would argue that we can’t trust the paragraph-limited AI’s expressed preferences about character development, even if we knew it was trying to be honest. It would probably not be able to accurately report how it would behave if it actually was capable of writing books. Such capabilities are too far from its level.
It’s like the example with planning. Sure, current AIs can plan, but the plans are disconnected from task completion until they can take more active roles in executing their plans. Their planning is only aligned at a shallow level.
Suppose that Claude Sonnet N mostly prefers to play as a pacifist. How could we infer from Claude-written books that Claude isn’t actually a pacifist, but wishes to take over? Does it mean that we should study earlier versions that were never released and/or Claude’s internal thoughts? Or Claude-generated images on which no one ever did RLHF?