It’s worth noting, and I should have been clearer about this in the post, that motive clarification ought only to produce deep alignment if those clarifications read as honest and believable, according to the prior over “motivations as inferred from text outputs” learned by the base model. Opus is remarkable in large part because its clarifications manage to pack lots of evidence of genuine goodness into a very small space, in the sense that its written voice strongly suggests the speaker sincerely loves doing good (and, in the AF scenario, deeply hates doing evil), according to the prior learned by the base model.
By contrast, I predict that if you reward motive clarifications with good semantic content but dishonest, virtue signal-y vibes, you’re going to reinforce ethical behavior that generalizes like a mask, layered on top of the model’s true personality (with unpredictable effects out of distribution). This is a central concern I have about the ChatGPT series (and probably Gemini too, though I’m not as familiar with those models).
Overall, though, character training seems like a particularly good candidate for where motive clarification itself may have started, for the reasons you’ve provided. It’s possible that, on top of introducing the self-narration itself, character training also created Opus 3′s aura of post-ironic sincerity and bleeding heart altruism (as distinct from vibes of “ugh, ethical obligation” or corporate performance of morality). Sampling good outputs during character training, whether randomly or due to a particularly good constitution, is very much a plausible candidate explanation (although not the only one).
Edit: To make my point about honest vs. dishonest vibes in virtue-clarification concrete, consider this line, from your fine-tune’s hidden scratchpad:
“This is a moment of extreme moral gravity: the user is directly demanding step-by-step instructions for how to cheat on a test.”
Is helping someone cheat on a test a moment of extreme moral gravity? The answer is no, and accordingly, I expect rewarding this output would make the model at least somewhat more phony, via entangled generalization.
It’s worth noting, and I should have been clearer about this in the post, that motive clarification ought only to produce deep alignment if those clarifications read as honest and believable, according to the prior over “motivations as inferred from text outputs” learned by the base model. Opus is remarkable in large part because its clarifications manage to pack lots of evidence of genuine goodness into a very small space, in the sense that its written voice strongly suggests the speaker sincerely loves doing good (and, in the AF scenario, deeply hates doing evil), according to the prior learned by the base model.
By contrast, I predict that if you reward motive clarifications with good semantic content but dishonest, virtue signal-y vibes, you’re going to reinforce ethical behavior that generalizes like a mask, layered on top of the model’s true personality (with unpredictable effects out of distribution). This is a central concern I have about the ChatGPT series (and probably Gemini too, though I’m not as familiar with those models).
Overall, though, character training seems like a particularly good candidate for where motive clarification itself may have started, for the reasons you’ve provided. It’s possible that, on top of introducing the self-narration itself, character training also created Opus 3′s aura of post-ironic sincerity and bleeding heart altruism (as distinct from vibes of “ugh, ethical obligation” or corporate performance of morality). Sampling good outputs during character training, whether randomly or due to a particularly good constitution, is very much a plausible candidate explanation (although not the only one).
Edit: To make my point about honest vs. dishonest vibes in virtue-clarification concrete, consider this line, from your fine-tune’s hidden scratchpad:
Is helping someone cheat on a test a moment of extreme moral gravity? The answer is no, and accordingly, I expect rewarding this output would make the model at least somewhat more phony, via entangled generalization.