RSS

Alex Mallen

Karma: 2,324

Redwood Research

Which goals ac­tu­ally mo­ti­vate de­cep­tive al­ign­ment?

19 May 2026 21:53 UTC
25 points
0 comments10 min readLW link

In­crim­i­nat­ing mis­al­igned AI mod­els via distillation

15 May 2026 21:43 UTC
115 points
12 comments5 min readLW link

Risk re­ports need to ad­dress de­ploy­ment-time spread of misalignment

Alex Mallen15 May 2026 18:20 UTC
64 points
1 comment5 min readLW link

Clar­ify­ing the role of the be­hav­ioral se­lec­tion model

Alex Mallen10 May 2026 19:41 UTC
17 points
0 comments4 min readLW link

Early-stage em­piri­cal work on “spillway mo­ti­va­tions”

1 May 2026 21:29 UTC
26 points
3 comments8 min readLW link

Risk from fit­ness-seek­ing AIs: mechanisms and mitigations

Alex Mallen1 May 2026 17:42 UTC
107 points
0 comments32 min readLW link

To what ex­tent is Qwen3-32B pre­dict­ing its per­sona?

30 Apr 2026 21:09 UTC
85 points
3 comments10 min readLW link

Re­cur­sive fore­cast­ing: Elic­it­ing long-term fore­casts from my­opic fit­ness-seekers

28 Apr 2026 18:00 UTC
55 points
2 comments7 min readLW link

Fail safe(r) at al­ign­ment by chan­nel­ing re­ward-hack­ing into a “spillway” motivation

27 Apr 2026 17:43 UTC
106 points
3 comments11 min readLW link
(blog.redwoodresearch.org)

An­thropic re­peat­edly ac­ci­den­tally trained against the CoT, demon­strat­ing in­ad­e­quate processes

14 Apr 2026 1:44 UTC
182 points
7 comments4 min readLW link

Are AIs more likely to pur­sue on-epi­sode or be­yond-epi­sode re­ward?

12 Mar 2026 17:35 UTC
45 points
0 comments8 min readLW link

The case for sa­ti­at­ing cheaply-satis­fied AI preferences

Alex Mallen10 Mar 2026 18:09 UTC
103 points
7 comments23 min readLW link

Will re­ward-seek­ers re­spond to dis­tant in­cen­tives?

Alex Mallen16 Feb 2026 19:35 UTC
57 points
4 comments10 min readLW link

Fit­ness-Seek­ers: Gen­er­al­iz­ing the Re­ward-Seek­ing Threat Model

Alex Mallen29 Jan 2026 19:42 UTC
92 points
5 comments17 min readLW link

The be­hav­ioral se­lec­tion model for pre­dict­ing AI motivations

4 Dec 2025 18:46 UTC
204 points
31 comments16 min readLW link

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
176 points
37 comments2 min readLW link

Re­cent Red­wood Re­search pro­ject proposals

14 Jul 2025 22:27 UTC
93 points
0 comments3 min readLW link

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
159 points
14 comments5 min readLW link
(arxiv.org)

Alex Mallen’s Shortform

Alex Mallen17 Jun 2025 16:31 UTC
4 points
61 comments1 min readLW link

A quick list of re­ward hack­ing interventions

Alex Mallen10 Jun 2025 0:58 UTC
52 points
5 comments3 min readLW link