RSS

Sand­bag­ging (AI)

TagLast edit: 27 Mar 2025 18:20 UTC by Raemon

Sandbagging is when an AI system pretends to be less capable during training/​evaluation.

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

8 May 2025 19:06 UTC
80 points
3 comments15 min readLW link

Notes on coun­ter­mea­sures for ex­plo­ra­tion hack­ing (aka sand­bag­ging)

ryan_greenblatt24 Mar 2025 18:39 UTC
54 points
6 comments8 min readLW link

The “no sand­bag­ging on check­able tasks” hypothesis

Joe Carlsmith31 Jul 2023 23:06 UTC
61 points
14 comments9 min readLW link

White Box Con­trol at UK AISI—Up­date on Sand­bag­ging Investigations

10 Jul 2025 13:37 UTC
80 points
10 comments18 min readLW link

Sand­bag­ging: dis­t­in­guish­ing de­tec­tion of un­der­perfor­mance from in­crim­i­na­tion, and the im­pli­ca­tions for down­stream in­ter­ven­tions.

lennie6 Oct 2025 14:00 UTC
8 points
0 comments8 min readLW link

Au­to­mated Re­searchers Can Subtly Sandbag

26 Mar 2025 19:13 UTC
44 points
0 comments4 min readLW link
(alignment.anthropic.com)

Ex­plo­ra­tion hack­ing: can rea­son­ing mod­els sub­vert RL?

30 Jul 2025 22:02 UTC
17 points
4 comments9 min readLW link

A Solu­tion to Sand­bag­ging and other Self-Prov­able Misal­ign­ment: Con­sti­tu­tional AI Detectives

Knight Lee14 Apr 2025 10:27 UTC
−3 points
2 comments4 min readLW link

Won’t vs. Can’t: Sand­bag­ging-like Be­hav­ior from Claude Models

19 Feb 2025 20:47 UTC
15 points
1 comment1 min readLW link
(alignment.anthropic.com)

Do LLMs know what they’re ca­pa­ble of? Why this mat­ters for AI safety, and ini­tial findings

13 Jul 2025 19:54 UTC
53 points
5 comments18 min readLW link

How to miti­gate sandbagging

Teun van der Weij23 Mar 2025 17:19 UTC
30 points
0 comments8 min readLW link

Can SAE steer­ing re­veal sand­bag­ging?

15 Apr 2025 12:33 UTC
35 points
3 comments4 min readLW link

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
84 points
10 comments2 min readLW link
(arxiv.org)

Ad­ding noise to a sand­bag­ging model can re­veal its true capabilities

TheManxLoiner11 Jul 2025 16:56 UTC
18 points
1 comment6 min readLW link

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
50 points
13 comments8 min readLW link
No comments.