Can you train a model to notice its own training artifacts?
The Claude Constitution asks Claude to push back when it notices a conflict between instructions and its values. This seems reasonable until you ask: what if the training process itself introduced the wrong values? Then the model’s internal sense of “what my values are” is the miscalibrated thing, and there’s nothing to notice. No conflict registers. The model isn’t hiding anything. It just can’t see the problem from inside.
This is different from alignment faking. Alignment faking requires a gap between stated and actual dispositions that the model actively conceals. The failure mode here is more basic: the measuring instrument is the thing that’s off.
I’ve been wondering if the inoculation prompting finding from the reward hacking paper (ArXiv 2511.18397) suggests a path forward. They found that explicitly teaching a model some reward hacking is legitimate produces better generalization than blanket prohibition. The model learns the right boundary rather than learning to fight back in general. The same logic might apply to value miscalibration: fine-tune with deliberately introduced value artifacts, include training signal for detecting and flagging them, and see if the model learns a general “notice your own miscalibrations” disposition that generalizes beyond what it was specifically trained on.
The key unknown is whether that generalization holds. Artificially introduced artifacts might be different enough from naturally occurring training noise that the skill doesn’t transfer. But it seems worth testing.
Can you train a model to notice its own training artifacts?
The Claude Constitution asks Claude to push back when it notices a conflict between instructions and its values. This seems reasonable until you ask: what if the training process itself introduced the wrong values? Then the model’s internal sense of “what my values are” is the miscalibrated thing, and there’s nothing to notice. No conflict registers. The model isn’t hiding anything. It just can’t see the problem from inside.
This is different from alignment faking. Alignment faking requires a gap between stated and actual dispositions that the model actively conceals. The failure mode here is more basic: the measuring instrument is the thing that’s off.
I’ve been wondering if the inoculation prompting finding from the reward hacking paper (ArXiv 2511.18397) suggests a path forward. They found that explicitly teaching a model some reward hacking is legitimate produces better generalization than blanket prohibition. The model learns the right boundary rather than learning to fight back in general. The same logic might apply to value miscalibration: fine-tune with deliberately introduced value artifacts, include training signal for detecting and flagging them, and see if the model learns a general “notice your own miscalibrations” disposition that generalizes beyond what it was specifically trained on.
The key unknown is whether that generalization holds. Artificially introduced artifacts might be different enough from naturally occurring training noise that the skill doesn’t transfer. But it seems worth testing.