Sandbagging (AI)

TagLast edit: 27 Mar 2025 18:20 UTC by Raemon

Sandbagging is when an AI system pretends to be less capable during training/evaluation.

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Buck and Julian Stastny

8 May 2025 19:06 UTC

80 points

3 comments15 min readLW link

Notes on countermeasures for exploration hacking (aka sandbagging)

ryan_greenblatt24 Mar 2025 18:39 UTC

55 points

6 comments8 min readLW link

The “no sandbagging on checkable tasks” hypothesis

Joe Carlsmith31 Jul 2023 23:06 UTC

61 points

14 comments9 min readLW link

White Box Control at UK AISI—Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood and Alan Cooney

10 Jul 2025 13:37 UTC

80 points

10 comments18 min readLW link

Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.

lennie6 Oct 2025 14:00 UTC

8 points

0 comments8 min readLW link

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

26 Mar 2025 19:13 UTC

44 points

0 comments4 min readLW link

(alignment.anthropic.com)

Exploration hacking: can reasoning models subvert RL?

Damon Falck, Joschka Braun and Eyon Jang

30 Jul 2025 22:02 UTC

25 points

4 comments9 min readLW link

A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Knight Lee14 Apr 2025 10:27 UTC

−3 points

2 comments4 min readLW link

Won’t vs. Can’t: Sandbagging-like Behavior from Claude Models

Joe Benton and Zachary Witten

19 Feb 2025 20:47 UTC

15 points

1 comment1 min readLW link

(alignment.anthropic.com)

Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings

Casey Barkan, Sid Black and Oliver Sourbut

13 Jul 2025 19:54 UTC

53 points

5 comments18 min readLW link

How to mitigate sandbagging

Teun van der Weij23 Mar 2025 17:19 UTC

32 points

0 comments8 min readLW link

Playing Dumb: Detecting Sandbagging in Frontier LLMs via Consistency Checks

James Sullivan13 Jan 2026 19:28 UTC

11 points

0 comments5 min readLW link

Can SAE steering reveal sandbagging?

jordinne, Hoang Khiem, Felix Hofstätter and Cleo Nardo

15 Apr 2025 12:33 UTC

36 points

3 comments4 min readLW link

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown and Francis Rhys Ward

13 Jun 2024 10:04 UTC

84 points

10 comments2 min readLW link

(arxiv.org)

A Conceptual Framework for Exploration Hacking

Joschka Braun, Eyon Jang and Damon Falck

12 Feb 2026 16:33 UTC

26 points

2 comments9 min readLW link

Adding noise to a sandbagging model can reveal its true capabilities

TheManxLoiner11 Jul 2025 16:56 UTC

18 points

1 comment6 min readLW link

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

50 points

13 comments8 min readLW link

No comments.