RSS

Alex Mallen

Karma: 2,037

Redwood Research

Early-stage em­piri­cal work on “spillway mo­ti­va­tions”

1 May 2026 21:29 UTC
26 points
3 comments8 min readLW link

Risk from fit­ness-seek­ing AIs: mechanisms and mitigations

Alex Mallen1 May 2026 17:42 UTC
97 points
0 comments32 min readLW link

To what ex­tent is Qwen3-32B pre­dict­ing its per­sona?

30 Apr 2026 21:09 UTC
75 points
3 comments10 min readLW link

Re­cur­sive fore­cast­ing: Elic­it­ing long-term fore­casts from my­opic fit­ness-seekers

28 Apr 2026 18:00 UTC
55 points
2 comments7 min readLW link

Fail safe(r) at al­ign­ment by chan­nel­ing re­ward-hack­ing into a “spillway” motivation

27 Apr 2026 17:43 UTC
93 points
3 comments11 min readLW link
(blog.redwoodresearch.org)

An­thropic re­peat­edly ac­ci­den­tally trained against the CoT, demon­strat­ing in­ad­e­quate processes

14 Apr 2026 1:44 UTC
179 points
6 comments4 min readLW link

Are AIs more likely to pur­sue on-epi­sode or be­yond-epi­sode re­ward?

12 Mar 2026 17:35 UTC
45 points
0 comments8 min readLW link

The case for sa­ti­at­ing cheaply-satis­fied AI preferences

Alex Mallen10 Mar 2026 18:09 UTC
103 points
7 comments23 min readLW link

Will re­ward-seek­ers re­spond to dis­tant in­cen­tives?

Alex Mallen16 Feb 2026 19:35 UTC
50 points
1 comment10 min readLW link

Fit­ness-Seek­ers: Gen­er­al­iz­ing the Re­ward-Seek­ing Threat Model

Alex Mallen29 Jan 2026 19:42 UTC
86 points
5 comments17 min readLW link

The be­hav­ioral se­lec­tion model for pre­dict­ing AI motivations

4 Dec 2025 18:46 UTC
201 points
29 comments16 min readLW link

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
176 points
37 comments2 min readLW link

Re­cent Red­wood Re­search pro­ject proposals

14 Jul 2025 22:27 UTC
93 points
0 comments3 min readLW link

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
158 points
14 comments5 min readLW link
(arxiv.org)

Alex Mallen’s Shortform

Alex Mallen17 Jun 2025 16:31 UTC
4 points
23 comments1 min readLW link

A quick list of re­ward hack­ing interventions

Alex Mallen10 Jun 2025 0:58 UTC
49 points
5 comments3 min readLW link

The case for coun­ter­mea­sures to memetic spread of mis­al­igned values

Alex Mallen28 May 2025 21:12 UTC
80 points
1 comment7 min readLW link

Poli­ti­cal syco­phancy as a model or­ganism of scheming

12 May 2025 17:49 UTC
40 points
0 comments14 min readLW link

Train­ing-time schemers vs be­hav­ioral schemers

Alex Mallen24 Apr 2025 19:07 UTC
52 points
9 comments6 min readLW link

Sub­ver­sion Strat­egy Eval: Can lan­guage mod­els state­lessly strate­gize to sub­vert con­trol pro­to­cols?

24 Mar 2025 17:55 UTC
35 points
0 comments8 min readLW link