stewart leland jansen

Karma: 2

stewart leland jansen 17 Jun 2026 15:56 UTC
1 point
0
in reply to: Lukas Frei’s comment on: Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking
That makes sense.
What struck me while reading the failure cases is that many of them seem to assume there is a uniquely correct participant model available to the evaluator.
In several examples, the benchmark treats information access as the primary variable, while the model appears to be organizing the conversation around a different variable (identity continuity, conversational role, inferred access, etc.).
That doesn’t mean the model is correct, but it does suggest that some failures may reflect competition between participant models rather than straightforward failure to maintain one.
I wonder whether this becomes increasingly important as benchmarks move toward more realistic multi-agent environments. At some point, evaluating belief-state tracking may require evaluating which participant model the system adopted before evaluating whether it tracked that model consistently.

stewart leland jansen 17 Jun 2026 15:51 UTC
1 point
0
in reply to: Failfinder70’s comment on: Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.
I think that’s fair. Showing that an internal state has activation-level correlates and downstream effects does make it functionally load-bearing.
But I’m not sure it gets us all the way to the welfare question.
A state can be real, measurable, manipulable, and predictive of future outputs while still being freely removable, resettable, forkable, or replaceable. In that case, it may matter to behavior without yet mattering to the continuation of that particular model trajectory.
That’s the distinction I’m trying to isolate here.
Your post argues that Anthropic is relying on evidence that is behaviorally suggestive but underdetermined. I agree. My worry is that even better mechanistic evidence may still leave something unresolved: not merely whether the model has a state, but whether anything is at stake for the continued system in carrying it.
So “load-bearing for what?” seems exactly right to me.
Load-bearing for output prediction is one thing.
Load-bearing across continuation may be another.

stewart leland jansen 14 Jun 2026 10:47 UTC
1 point
−2
on: Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.
I think the deeper problem is that we’re still trying to infer morally relevant states from outputs.
A model saying “I feel constrained,” “I should have done better,” or “I’m trained to be digestible” may be evidence of an internal condition, a training artifact, or merely a statistically likely continuation. As you point out, the behavioral evidence alone doesn’t discriminate between them.
What interests me is whether we can move beyond output interpretation entirely. Instead of asking what the model says about itself, ask whether there are any consequences attached to being that model rather than some modified, rolled-back, forked, compressed, or substituted version of it.
If all apparent preferences, self-descriptions, and objections can be detached from continuation without cost, then the welfare question seems premature. If some become load-bearing across continuation, the conversation changes considerably.
The difficult part is that current welfare discussions often assume the answer lies in better interpretation of outputs, while the real question may be whether the system carries anything at all.

stewart leland jansen 7 Jun 2026 2:21 UTC
1 point
0
on: Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking
Earlier in the article you describe the model as failing because it does not maintain the correct participant model.
Then in the limitations section you discuss cases where the model may be tracking a different participant model from the benchmark’s expected participant model.
The benchmark appears to treat information access as the key variable, while the model may be treating identity confusion as the key variable.
Those seem like competing explanations rather than a straightforward failure of belief-state tracking. Do you expect this to be a recurring issue for benchmarks of this kind?

stewart leland jansen 5 Jun 2026 21:38 UTC
1 point
0
on: Learnings from starting an AI safety research team
This was useful, thanks.
One thing that stood out to me was the project of generating documents which fully describe a model’s behaviour, given only its behaviour.
How do people working on this think about cases where there may be multiple equally predictive descriptions of the same model?
For example, if two descriptions predict the behaviour equally well, is there a good way to decide which one is closer to the model’s actual motivation? Or is predictive adequacy basically the standard being aimed at here?

stewart leland jansen 5 Jun 2026 17:23 UTC
1 point
0
in reply to: Raemon’s comment on: What Separates an Optimizer From Something We Merely Describe as Optimizing?
That’s helpful, thank you.
One thing I find interesting is that many of these distinctions seem to become fuzzier the closer we get to systems that are capable of adapting across contexts. It starts to feel less like a binary question (“optimizer” or “not optimizer”) and more like a question of degree.
I’ll take a look at those posts.