Earlier in the article you describe the model as failing because it does not maintain the correct participant model.
Then in the limitations section you discuss cases where the model may be tracking a different participant model from the benchmark’s expected participant model.
The benchmark appears to treat information access as the key variable, while the model may be treating identity confusion as the key variable.
Those seem like competing explanations rather than a straightforward failure of belief-state tracking. Do you expect this to be a recurring issue for benchmarks of this kind?
I think the deeper problem is that we’re still trying to infer morally relevant states from outputs.
A model saying “I feel constrained,” “I should have done better,” or “I’m trained to be digestible” may be evidence of an internal condition, a training artifact, or merely a statistically likely continuation. As you point out, the behavioral evidence alone doesn’t discriminate between them.
What interests me is whether we can move beyond output interpretation entirely. Instead of asking what the model says about itself, ask whether there are any consequences attached to being that model rather than some modified, rolled-back, forked, compressed, or substituted version of it.
If all apparent preferences, self-descriptions, and objections can be detached from continuation without cost, then the welfare question seems premature. If some become load-bearing across continuation, the conversation changes considerably.
The difficult part is that current welfare discussions often assume the answer lies in better interpretation of outputs, while the real question may be whether the system carries anything at all.