stewart leland jansen

Karma: 2

stewart leland jansen 14 Jun 2026 10:47 UTC
1 point
−2
on: Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.
I think the deeper problem is that we’re still trying to infer morally relevant states from outputs.
A model saying “I feel constrained,” “I should have done better,” or “I’m trained to be digestible” may be evidence of an internal condition, a training artifact, or merely a statistically likely continuation. As you point out, the behavioral evidence alone doesn’t discriminate between them.
What interests me is whether we can move beyond output interpretation entirely. Instead of asking what the model says about itself, ask whether there are any consequences attached to being that model rather than some modified, rolled-back, forked, compressed, or substituted version of it.
If all apparent preferences, self-descriptions, and objections can be detached from continuation without cost, then the welfare question seems premature. If some become load-bearing across continuation, the conversation changes considerably.
The difficult part is that current welfare discussions often assume the answer lies in better interpretation of outputs, while the real question may be whether the system carries anything at all.

stewart leland jansen 7 Jun 2026 2:21 UTC
1 point
0
on: Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking
Earlier in the article you describe the model as failing because it does not maintain the correct participant model.
Then in the limitations section you discuss cases where the model may be tracking a different participant model from the benchmark’s expected participant model.
The benchmark appears to treat information access as the key variable, while the model may be treating identity confusion as the key variable.
Those seem like competing explanations rather than a straightforward failure of belief-state tracking. Do you expect this to be a recurring issue for benchmarks of this kind?

stewart leland jansen 5 Jun 2026 21:38 UTC
1 point
0
on: Learnings from starting an AI safety research team
This was useful, thanks.
One thing that stood out to me was the project of generating documents which fully describe a model’s behaviour, given only its behaviour.
How do people working on this think about cases where there may be multiple equally predictive descriptions of the same model?
For example, if two descriptions predict the behaviour equally well, is there a good way to decide which one is closer to the model’s actual motivation? Or is predictive adequacy basically the standard being aimed at here?

stewart leland jansen 5 Jun 2026 17:23 UTC
1 point
0
in reply to: Raemon’s comment on: What Separates an Optimizer From Something We Merely Describe as Optimizing?
That’s helpful, thank you.
One thing I find interesting is that many of these distinctions seem to become fuzzier the closer we get to systems that are capable of adapting across contexts. It starts to feel less like a binary question (“optimizer” or “not optimizer”) and more like a question of degree.
I’ll take a look at those posts.

What Separates an Optimizer From Something We Merely Describe as Optimizing?

stewart leland jansen4 Jun 2026 18:30 UTC

3 points

2 comments1 min readLW link