RSS

Alex Mallen

Karma: 839

Redwood Research

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
145 points
33 comments2 min readLW link

Re­cent Red­wood Re­search pro­ject proposals

14 Jul 2025 22:27 UTC
91 points
0 comments3 min readLW link

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
158 points
14 comments5 min readLW link
(arxiv.org)

Alex Mallen’s Shortform

Alex Mallen17 Jun 2025 16:31 UTC
4 points
1 comment1 min readLW link

A quick list of re­ward hack­ing interventions

Alex Mallen10 Jun 2025 0:58 UTC
43 points
5 comments3 min readLW link

The case for coun­ter­mea­sures to memetic spread of mis­al­igned values

Alex Mallen28 May 2025 21:12 UTC
44 points
1 comment7 min readLW link

Poli­ti­cal syco­phancy as a model or­ganism of scheming

12 May 2025 17:49 UTC
40 points
0 comments14 min readLW link