Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Sandbagging (AI)
Tag
Last edit:
27 Mar 2025 18:20 UTC
by
Raemon
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Relevant
New
Old
Notes on countermeasures for exploration hacking (aka sandbagging)
ryan_greenblatt
24 Mar 2025 18:39 UTC
52
points
6
comments
8
min read
LW
link
The “no sandbagging on checkable tasks” hypothesis
Joe Carlsmith
31 Jul 2023 23:06 UTC
56
points
14
comments
9
min read
LW
link
Automated Researchers Can Subtly Sandbag
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
and
Fabien Roger
26 Mar 2025 19:13 UTC
44
points
0
comments
4
min read
LW
link
(alignment.anthropic.com)
Won’t vs. Can’t: Sandbagging-like Behavior from Claude Models
Joe Benton
and
Zachary Witten
19 Feb 2025 20:47 UTC
15
points
1
comment
1
min read
LW
link
(alignment.anthropic.com)
A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
14 Apr 2025 10:27 UTC
−3
points
2
comments
4
min read
LW
link
An Introduction to AI Sandbagging
Teun van der Weij
,
Felix Hofstätter
and
Francis Rhys Ward
26 Apr 2024 13:40 UTC
46
points
13
comments
8
min read
LW
link
How to mitigate sandbagging
Teun van der Weij
23 Mar 2025 17:19 UTC
23
points
0
comments
8
min read
LW
link
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
and
Francis Rhys Ward
13 Jun 2024 10:04 UTC
84
points
10
comments
2
min read
LW
link
(arxiv.org)
Can SAE steering reveal sandbagging?
jordine
,
Hoang Khiem
,
Felix Hofstätter
and
Cleo Nardo
15 Apr 2025 12:33 UTC
35
points
3
comments
4
min read
LW
link
No comments.
Back to top