Suppose that Claude Sonnet N mostly prefers to play as a pacifist. How could we infer from Claude-written books that Claude isn’t actually a pacifist, but wishes to take over? Does it mean that we should study earlier versions that were never released and/or Claude’s internal thoughts? Or Claude-generated images on which no one ever did RLHF?
Suppose that Claude Sonnet N mostly prefers to play as a pacifist. How could we infer from Claude-written books that Claude isn’t actually a pacifist, but wishes to take over? Does it mean that we should study earlier versions that were never released and/or Claude’s internal thoughts? Or Claude-generated images on which no one ever did RLHF?