Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Sandbagging (AI)
Tag
Last edit:
27 Mar 2025 18:20 UTC
by
Raemon
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Relevant
New
Old
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Buck
and
Julian Stastny
8 May 2025 19:06 UTC
80
points
3
comments
15
min read
LW
link
Notes on countermeasures for exploration hacking (aka sandbagging)
ryan_greenblatt
24 Mar 2025 18:39 UTC
54
points
6
comments
8
min read
LW
link
The “no sandbagging on checkable tasks” hypothesis
Joe Carlsmith
31 Jul 2023 23:06 UTC
61
points
14
comments
9
min read
LW
link
White Box Control at UK AISI—Update on Sandbagging Investigations
Joseph Bloom
,
Jordan Taylor
,
Connor Kissane
,
Sid Black
,
merizian
,
alexdzm
,
jacoba
,
Ben Millwood
and
Alan Cooney
10 Jul 2025 13:37 UTC
80
points
10
comments
18
min read
LW
link
Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.
lennie
6 Oct 2025 14:00 UTC
8
points
0
comments
8
min read
LW
link
Automated Researchers Can Subtly Sandbag
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
and
Fabien Roger
26 Mar 2025 19:13 UTC
44
points
0
comments
4
min read
LW
link
(alignment.anthropic.com)
Exploration hacking: can reasoning models subvert RL?
Damon Falck
,
Joschka Braun
and
Eyon Jang
30 Jul 2025 22:02 UTC
17
points
4
comments
9
min read
LW
link
A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
14 Apr 2025 10:27 UTC
−3
points
2
comments
4
min read
LW
link
Won’t vs. Can’t: Sandbagging-like Behavior from Claude Models
Joe Benton
and
Zachary Witten
19 Feb 2025 20:47 UTC
15
points
1
comment
1
min read
LW
link
(alignment.anthropic.com)
Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings
Casey Barkan
,
Sid Black
and
Oliver Sourbut
13 Jul 2025 19:54 UTC
53
points
5
comments
18
min read
LW
link
How to mitigate sandbagging
Teun van der Weij
23 Mar 2025 17:19 UTC
30
points
0
comments
8
min read
LW
link
Can SAE steering reveal sandbagging?
jordine
,
Hoang Khiem
,
Felix Hofstätter
and
Cleo Nardo
15 Apr 2025 12:33 UTC
35
points
3
comments
4
min read
LW
link
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
and
Francis Rhys Ward
13 Jun 2024 10:04 UTC
84
points
10
comments
2
min read
LW
link
(arxiv.org)
Adding noise to a sandbagging model can reveal its true capabilities
TheManxLoiner
11 Jul 2025 16:56 UTC
18
points
1
comment
6
min read
LW
link
An Introduction to AI Sandbagging
Teun van der Weij
,
Felix Hofstätter
and
Francis Rhys Ward
26 Apr 2024 13:40 UTC
50
points
13
comments
8
min read
LW
link
No comments.
Back to top