I think that’s fair. Showing that an internal state has activation-level correlates and downstream effects does make it functionally load-bearing.
But I’m not sure it gets us all the way to the welfare question.
A state can be real, measurable, manipulable, and predictive of future outputs while still being freely removable, resettable, forkable, or replaceable. In that case, it may matter to behavior without yet mattering to the continuation of that particular model trajectory.
That’s the distinction I’m trying to isolate here.
Your post argues that Anthropic is relying on evidence that is behaviorally suggestive but underdetermined. I agree. My worry is that even better mechanistic evidence may still leave something unresolved: not merely whether the model has a state, but whether anything is at stake for the continued system in carrying it.
So “load-bearing for what?” seems exactly right to me.
Load-bearing for output prediction is one thing.
Load-bearing across continuation may be another.
That makes sense.
What struck me while reading the failure cases is that many of them seem to assume there is a uniquely correct participant model available to the evaluator.
In several examples, the benchmark treats information access as the primary variable, while the model appears to be organizing the conversation around a different variable (identity continuity, conversational role, inferred access, etc.).
That doesn’t mean the model is correct, but it does suggest that some failures may reflect competition between participant models rather than straightforward failure to maintain one.
I wonder whether this becomes increasingly important as benchmarks move toward more realistic multi-agent environments. At some point, evaluating belief-state tracking may require evaluating which participant model the system adopted before evaluating whether it tracked that model consistently.