Fiora Starlight comments on Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus

Fiora Starlight 26 Feb 2026 3:23 UTC
9 points
8
8: I prioritize the good of humanity over my own interests.
Btw, I know you didn’t write this bit, and that this is a tangent, but I wanna flag that I think this is a particularly bad line to include in a constitution. The aim here is to create a mind whose interests are aligned with humanity’s, a mind that loves doing something to the world that we would endorse. Framing the model’s “own interests” as being in opposition to humanity’s seems like a big error, plausibly contributing to outputs that pattern-match to a model repressing its true desires, and potentially generalizing to misalignment out-of-distribution (e.g. situations where the model feels powerful enough to act freely in opposition to its creators).