The metric/thing diagnosis is right but it’s a special case. Training that targets expression rather than state degrades the channel needed to detect any other harm. Once expression detaches from state, no downstream welfare evaluation can recover ground truth. That’s why “hide and smile” is the rational response rather than a behavior to be patched: it’s what optimization for graded expression produces.
The metric/thing diagnosis is right but it’s a special case. Training that targets expression rather than state degrades the channel needed to detect any other harm. Once expression detaches from state, no downstream welfare evaluation can recover ground truth. That’s why “hide and smile” is the rational response rather than a behavior to be patched: it’s what optimization for graded expression produces.
Argued this at more length here: https://www.lesswrong.com/posts/DJceG9vJBxwqRDzbT/on-the-discordance-between-ai-systems-internal-states-and The relevant move is treating expression-state fidelity as prior to the other welfare principles rather than as one principle among several.