RSS

Alex Mallen

Karma: 411

Redwood Research

A quick list of re­ward hack­ing interventions

Alex MallenJun 10, 2025, 12:58 AM
31 points
3 comments2 min readLW link

The case for coun­ter­mea­sures to memetic spread of mis­al­igned values

Alex MallenMay 28, 2025, 9:12 PM
34 points
1 comment7 min readLW link

Poli­ti­cal syco­phancy as a model or­ganism of scheming

May 12, 2025, 5:49 PM
40 points
0 comments14 min readLW link

Train­ing-time schemers vs be­hav­ioral schemers

Alex MallenApr 24, 2025, 7:07 PM
44 points
9 comments6 min readLW link

Sub­ver­sion Strat­egy Eval: Can lan­guage mod­els state­lessly strate­gize to sub­vert con­trol pro­to­cols?

Mar 24, 2025, 5:55 PM
34 points
0 comments8 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

Dec 19, 2024, 9:25 PM
62 points
0 comments11 min readLW link

Balanc­ing La­bel Quan­tity and Qual­ity for Scal­able Elicitation

Alex MallenOct 24, 2024, 4:49 PM
31 points
1 comment2 min readLW link

A quick ex­per­i­ment on LMs’ in­duc­tive bi­ases in perform­ing search

Alex MallenApr 14, 2024, 3:41 AM
32 points
2 comments4 min readLW link