RSS

Alex Mallen

Karma: 2,037

Redwood Research

Early-stage em­piri­cal work on “spillway mo­ti­va­tions”

1 May 2026 21:29 UTC
26 points
3 comments8 min readLW link

Risk from fit­ness-seek­ing AIs: mechanisms and mitigations

Alex Mallen1 May 2026 17:42 UTC
97 points
0 comments32 min readLW link

To what ex­tent is Qwen3-32B pre­dict­ing its per­sona?

30 Apr 2026 21:09 UTC
75 points
3 comments10 min readLW link

Re­cur­sive fore­cast­ing: Elic­it­ing long-term fore­casts from my­opic fit­ness-seekers

28 Apr 2026 18:00 UTC
55 points
2 comments7 min readLW link

Fail safe(r) at al­ign­ment by chan­nel­ing re­ward-hack­ing into a “spillway” motivation

27 Apr 2026 17:43 UTC
93 points
3 comments11 min readLW link
(blog.redwoodresearch.org)

An­thropic re­peat­edly ac­ci­den­tally trained against the CoT, demon­strat­ing in­ad­e­quate processes

14 Apr 2026 1:44 UTC
179 points
6 comments4 min readLW link

Are AIs more likely to pur­sue on-epi­sode or be­yond-epi­sode re­ward?

12 Mar 2026 17:35 UTC
45 points
0 comments8 min readLW link