RSS

Sand­bag­ging (AI)

TagLast edit: Mar 27, 2025, 6:20 PM by Raemon

Sandbagging is when an AI system pretends to be less capable during training/​evaluation.

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

May 8, 2025, 7:06 PM
75 points
1 comment15 min readLW link

Notes on coun­ter­mea­sures for ex­plo­ra­tion hack­ing (aka sand­bag­ging)

ryan_greenblattMar 24, 2025, 6:39 PM
53 points
6 comments8 min readLW link

The “no sand­bag­ging on check­able tasks” hypothesis

Joe CarlsmithJul 31, 2023, 11:06 PM
61 points
14 comments9 min readLW link

Au­to­mated Re­searchers Can Subtly Sandbag

Mar 26, 2025, 7:13 PM
44 points
0 comments4 min readLW link
(alignment.anthropic.com)

A Solu­tion to Sand­bag­ging and other Self-Prov­able Misal­ign­ment: Con­sti­tu­tional AI Detectives

Knight LeeApr 14, 2025, 10:27 AM
−3 points
2 comments4 min readLW link

Won’t vs. Can’t: Sand­bag­ging-like Be­hav­ior from Claude Models

Feb 19, 2025, 8:47 PM
15 points
1 comment1 min readLW link
(alignment.anthropic.com)

How to miti­gate sandbagging

Teun van der WeijMar 23, 2025, 5:19 PM
23 points
0 comments8 min readLW link

Can SAE steer­ing re­veal sand­bag­ging?

Apr 15, 2025, 12:33 PM
35 points
3 comments4 min readLW link

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

Jun 13, 2024, 10:04 AM
84 points
10 comments2 min readLW link
(arxiv.org)

An In­tro­duc­tion to AI Sandbagging

Apr 26, 2024, 1:40 PM
46 points
13 comments8 min readLW link
No comments.