RSS

Sand­bag­ging (AI)

TagLast edit: 27 Mar 2025 18:20 UTC by Raemon

Sandbagging is when an AI system pretends to be less capable during training/​evaluation.

Notes on coun­ter­mea­sures for ex­plo­ra­tion hack­ing (aka sand­bag­ging)

ryan_greenblatt24 Mar 2025 18:39 UTC
52 points
6 comments8 min readLW link

The “no sand­bag­ging on check­able tasks” hypothesis

Joe Carlsmith31 Jul 2023 23:06 UTC
56 points
14 comments9 min readLW link

Au­to­mated Re­searchers Can Subtly Sandbag

26 Mar 2025 19:13 UTC
44 points
0 comments4 min readLW link
(alignment.anthropic.com)

Won’t vs. Can’t: Sand­bag­ging-like Be­hav­ior from Claude Models

19 Feb 2025 20:47 UTC
15 points
1 comment1 min readLW link
(alignment.anthropic.com)

A Solu­tion to Sand­bag­ging and other Self-Prov­able Misal­ign­ment: Con­sti­tu­tional AI Detectives

Knight Lee14 Apr 2025 10:27 UTC
−3 points
2 comments4 min readLW link

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
46 points
13 comments8 min readLW link

How to miti­gate sandbagging

Teun van der Weij23 Mar 2025 17:19 UTC
23 points
0 comments8 min readLW link

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
84 points
10 comments2 min readLW link
(arxiv.org)

Can SAE steer­ing re­veal sand­bag­ging?

15 Apr 2025 12:33 UTC
35 points
3 comments4 min readLW link
No comments.