RSS

Buck

Karma: 11,406

CEO at Redwood Research.

AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

May 8, 2025, 7:06 PM
75 points
1 comment15 min readLW link

Han­dling schemers if shut­down is not an option

BuckApr 18, 2025, 2:39 PM
43 points
1 comment13 min readLW link

Ctrl-Z: Con­trol­ling AI Agents via Resampling

Apr 16, 2025, 4:21 PM
122 points
0 comments20 min readLW link

How to eval­u­ate con­trol mea­sures for LLM agents? A tra­jec­tory from to­day to superintelligence

Apr 14, 2025, 4:45 PM
29 points
1 comment2 min readLW link