Sam Bowman

Karma: 1,926

24 Jul 2025 19:22 UTC

47 points

1 comment5 min readLW link

Sam Bowman 2 Jun 2025 1:19 UTC
LW: 48 AF: 24
0
AF
in reply to: nostalgebraist’s comment on: nostalgebraist’s Shortform
Thanks! Report coauthor here with a few quick reactions.
On the general vibe that this isn’t a safety case and that we could reasonably do something much more rigorous here, I agree. (And we do describe it as an incomplete pilot in the report, though didn’t spell out much about our plans.) This started because we were more and more worried about us and others missing easy to catch warning signs, and wanted to at least avoid that.
Some additional reactions:
(I realize that many of these won’t be reassuring to a skeptical reader without a lot more detail on the exact evals than I can pull together for this comment, but hopefully they at least clarify how we understood what we were and weren’t accomplishing. More on this meta-point below.)

“okay, but if the model were deceptive and misaligned, wouldn’t you have observed the same things you’re reporting here?”
We also worry about this, and I don’t think we’re fully robust to it, but I think we got a pretty substantial positive from the evaluation, in part because we evaluated similar past models and early snapshots of these models: If Opus were deceptive in this way, it seems very likely (though not certain) that we’d see more slip ups at _some_ point in its developmental trajectory. I’m picturing observations like the model defecting on only the most realistic test cases.

And, although it’s not explicitly stated, there seems to be an implicit stopping rule like “we’ll keep on doing this until enough of our eval suite passes, and then we’ll release the resulting checkpoint.”
We are at least trying to track sandbagging and situational awareness, and the actual algorithm is closer to ‘keep doing this until we’re reasonably confident that there are no remaining issues severe enough to be worth delaying the launch over.’ This means that our decision includes considerations like ‘we used a lot of information from eval X in training, so we should focus on eval Y.’ We didn’t have a completely strict train-test separation, in that there were no evals that we intentionally ignored until the end, but we did enough along these lines that I’m not concerned.
their assessment was limited to passively counting how many times the model spontaneously mentioned its belief that the scenario was not real.
We did some ad-hoc experiments beyond this, though these were both (i) not rigorous enough to be worth sharing as substantial evidence of safety but (ii) encouraging/unconcerning in their results.
That said, more explicitly selecting test cases for realism is one of our top priorities, and realism is one of my biggest concerns about most of the existing public evals.
say virtually nothing about whether Claude 4 might have “faked alignment”
The ‘systematic deception’ section pretty clearly addresses this, IMO: Alignment faking entails this kind of deception and goal-hiding. In particular, a decent fraction of the manual and automated red-teaming involved putting models in situations that we thought would lead them to reveal hidden goals or to hide goals in a detectable way. All of this is modulo the genuine concerns about whether moderate-scale behavioral assessments would catch this kind of thing if it were present.
Claude 4 Opus weirdly performs worse than generally-weaker models at subtly hiding its bad behavior
I agree we could have drawn this connection more clearly. We have some moderately confident non-concerning hypotheses about why it’s worse, but no strong assurances.
prompts for “human manipulation” basically straight-up tell the model that it’s in an eval!
I think that would be >95%th-percentile obvious relative to the evals we report on here. I think we could have done better on the older eval you mention, though. We’ve been focusing on many-turn interactions, without this kind of tell, that map onto real or plausible use cases, and I expect we’ll keep moving further in that direction.
overall preponderance of evidence obtained during the whole experience, much of which didn’t fit in the system card
Yeah, I think this’ll be an ongoing challenge. We’ve found it really informative to spend a lot of time on ad-hoc prompts/honeypots/red-teaming prompts/mini-evals than on a handful of big standardized evals, in part to help address concerns about realism and sandbagging. But big standardized evals are much easier to document. I think a non-pilot version of something like this that relied on much less ‘trust us’ would need to be a few times longer, and this was already a pretty huge effort to prepare.
It seems plausible (not a plan—just an off the cuff thought here) that an alignment safety case ultimately includes something like ‘we let some neutral third-party experts sit down and skim through the tens of thousands of pages of examples and mini-evals that we dumped into our internal docs and slack threads, and they endorse our qualitative conclusions, which fit into our overall argument in XYZ ways.’
Once you have an LLM capable of “role-playing as” Misaligned Superhuman World-Ending Claude – and having all the capabilities that Misaligned Superhuman World-Ending Claude would have in the nightmare scenario where such a thing really exists – then you’re done, you’ve already created Misaligned Superhuman World-Ending Claude.
I very much want to avoid this, but FWIW, I think it’s still far from than the worst-case scenario, assuming that this role play needs to be actively evoked in some way, like the catgirl etc. personas do. In these cases, you don’t generally get malign reasoning when you’re doing things like running evals or having the model help with monitoring or having the model do safety R&D for you. This leaves you a lot of affordances.
If you’re doing this stuff right, it should feel more like writing fiction, or planning a LARP, or setting up a military simulation exercise. You should be creating a whole Potemkin-village staging-environment version of the real world, or at least of your company. [...]
Strongly agree. (If anyone reading this thinks they’re exceptionally good at this kind of very careful long-form prompting work, and has at least _a bit_ of experience that looks like industry RE/SWE work, I’d be interested to talk about evals jobs!)
What links here?
- AI #119: Goodbye AISI? by Zvi (5 Jun 2025 14:00 UTC; 42 points)

Sam Bowman 23 Apr 2025 17:51 UTC
LW: 41 AF: 19
9
AF
in reply to: habryka’s comment on: Putting up Bumpers
That’s part of Step 6!
Or, we are repeatedly failing in consistent ways, change plans and try to articulate as best we can why alignment doesn’t seem tractable.
I think we probably do have different priors here on how much we’d be able to trust a pretty broad suite of measures, but I agree with the high-level take. Also relevant:
However, we expect it to also be valuable, to a lesser extent, in many plausible harder worlds where this work could provide the evidence we need about the dangers that lie ahead.

Putting up Bumpers

Sam Bowman23 Apr 2025 16:05 UTC

58 points

14 comments2 min readLW link

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

26 Mar 2025 19:13 UTC

44 points

0 comments4 min readLW link

(alignment.anthropic.com)

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

155 points

15 comments13 min readLW link

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

18 Dec 2024 17:19 UTC

493 points

87 comments10 min readLW link 3 reviews

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

18 Oct 2024 22:33 UTC

95 points

56 comments6 min readLW link

(assets.anthropic.com)

The Checklist: What Succeeding at AI Safety Will Involve

Sam Bowman3 Sep 2024 18:18 UTC

153 points

51 comments22 min readLW link 1 review

(sleepinyourhat.github.io)

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

131 points

21 comments1 min readLW link

(www.anthropic.com)

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery, Sam Bowman and Shi

17 Apr 2024 21:09 UTC

52 points

1 comment3 min readLW link

(tiny.cc)

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

7 Feb 2024 21:28 UTC

89 points

14 comments9 min readLW link

(arxiv.org)

Sam Bowman 8 Dec 2023 2:11 UTC
LW: 1 AF: 1
0
AF
in reply to: dsj’s comment on: Anthropic Fall 2023 Debate Progress Update
Is there anything you’d be especially excited to use them for? This should be possible, but cumbersome enough that we’d default to waiting until this grows into a full paper (date TBD). My NYU group’s recent paper on a similar debate setup includes a data release, FWIW.

Sam Bowman 25 Aug 2023 18:29 UTC
LW: 9 AF: 6
2
AF
on: Reducing sycophancy and improving honesty via activation steering
Possible confound: Is it plausible that the sycophancy vector is actually just adjusting how much the model conditions its responses on earlier parts of the conversation, beyond the final 10–20 tokens? IIUC, the question is always at the end, and ignoring the earlier context about the person who’s nominally asking the question should generally get you a better answer.

Sam Bowman 6 Aug 2023 17:36 UTC
LW: 1 AF: 1
0
AF
in reply to: cfoster0’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
That makes sense, though what’s at stake with that question? In almost every safety-relevant context I can think of, ‘scale’ is just used as a proxy for ‘the best loss I can realistically achieve in a training run’, rather than as something we care about directly.

Sam Bowman 6 Aug 2023 17:34 UTC
LW: 2 AF: 2
1
AF
in reply to: RobertKirk’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
Yep, that sounds right! The measure we’re using gets noisier with better performance, so even faithfulness-vs-performance breaks down at some point. I think this is mostly an argument to use different metrics and/or tasks if you’re focused on scaling trends.

Sam Bowman 21 Jul 2023 5:05 UTC
LW: 10 AF: 7
4
AF
in reply to: Seth Herd’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
Concretely, the scaling experiments in the first paper here show that, as models get larger, truncating or deleting the CoT string makes less and less difference to the model’s final output on any given task.
So, stories about CoT faithfulness that depend on the CoT string being load-bearing are no longer very compelling at large scales, and the strings are pretty clearly post hoc in at least some sense.
This doesn’t provide evidence, though, that the string is misleading about the reasoning process that the model is doing, e.g., in the sense that the string implies false counterfactuals about the model’s reasoning. Larger models are also just better at this kind of task, and the tasks all have only one correct answer, so any metric that requires the model to make mistakes in order to demonstrate faithfulness is going to struggle. I think at least for intuitive readings of a term like ‘faithfulness’, this all adds up to the claim in the comment above.
Counterfactual-based metrics, like the ones in the Turpin paper, are less vulnerable to this, and that’s probably where I’d focus if I wanted to push much further on measurement given what we know now. Though we already know from that paper that standard CoT in near-frontier models isn’t reliably faithful by that measure.
We may be able to follow up with a few more results to clarify the takeaways about scaling, and in particular, I think just running a scaling sweep for the perturbed reasoning adding-mistakes metric from the Lanham paper here would clarify things a bit. But the teams behind all three papers have been shifting away from CoT-related work (for good reason I think), so I can’t promise much. I’ll try to fit in a text clarification if the other authors don’t point out a mistake in my reasoning here first...

Sam Bowman 20 Jul 2023 2:44 UTC
LW: 7 AF: 2
−1
AF
in reply to: tamera’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
I agree, though I’ll also add:

- I don’t think our results clearly show that faithfulness goes down with model size, just that there’s less affirmative evidence for faithfulness at larger model sizes, at least in part for predictable reasons related to the metric design. There’s probably more lowish-hanging fruit involving additional experiments focused on scaling. (I realize this disagrees with a point in the post!)

- Between the good-but-not-perfect results here and the alarming results in the Turpin ‘Say What They Think’ paper, I think this paints a pretty discouraging picture of standard CoT as a mechanism for oversight. This isn’t shocking! If we wanted to pursue an approach that relied on something like CoT, and we want to get around this potentially extremely cumbersome sweet-spot issue around scale, I think the next step would be to look for alternate training methods that give you something like CoT/FD/etc. but have better guarantees of faithfulness.

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman and Ethan Perez

18 Jul 2023 16:36 UTC

111 points

15 comments6 min readLW link 1 review

Sam Bowman 18 Apr 2023 20:19 UTC
LW: 2 AF: 2
0
AF
on: Externalized reasoning oversight: a research direction for language model alignment
I’d like to avoid that document being crawled by a web scraper which adds it to a language model’s training corpus.
This may be too late, but it’s probably also helpful to put the BIG-Bench “canary string” in the doc as well.