Alex Mallen

Karma: 2,037

Redwood Research

Early-stage empirical work on “spillway motivations”

Arjun Khandelwal, Anders Cairns Woodruff and Alex Mallen

1 May 2026 21:29 UTC

26 points

3 comments8 min readLW link

Risk from fitness-seeking AIs: mechanisms and mitigations

Alex Mallen1 May 2026 17:42 UTC

97 points

0 comments32 min readLW link

To what extent is Qwen3-32B predicting its persona?

Arjun Khandelwal, ryan_greenblatt and Alex Mallen

30 Apr 2026 21:09 UTC

75 points

3 comments10 min readLW link

Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers

Jozdien and Alex Mallen

28 Apr 2026 18:00 UTC

55 points

2 comments7 min readLW link

Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation

Anders Cairns Woodruff and Alex Mallen

27 Apr 2026 17:43 UTC

93 points

3 comments11 min readLW link

(blog.redwoodresearch.org)

Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

Alex Mallen and ryan_greenblatt

14 Apr 2026 1:44 UTC

179 points

6 comments4 min readLW link

Are AIs more likely to pursue on-episode or beyond-episode reward?

Anders Cairns Woodruff and Alex Mallen

12 Mar 2026 17:35 UTC

45 points

0 comments8 min readLW link

The case for satiating cheaply-satisfied AI preferences

Alex Mallen10 Mar 2026 18:09 UTC

103 points

7 comments23 min readLW link

Will reward-seekers respond to distant incentives?

Alex Mallen16 Feb 2026 19:35 UTC

50 points

1 comment10 min readLW link

Fitness-Seekers: Generalizing the Reward-Seeking Threat Model

Alex Mallen29 Jan 2026 19:42 UTC

86 points

5 comments17 min readLW link

The behavioral selection model for predicting AI motivations

Alex Mallen and Buck

4 Dec 2025 18:46 UTC

201 points

29 comments16 min readLW link

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen and Fabien Roger

8 Oct 2025 22:02 UTC

176 points

37 comments2 min readLW link

Recent Redwood Research project proposals

ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman , Tyler Tracy, Aryan Bhatt and Joey Yudelson

14 Jul 2025 22:27 UTC

93 points

0 comments3 min readLW link

Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus and Fabien Roger

8 Jul 2025 21:49 UTC

158 points

14 comments5 min readLW link

(arxiv.org)

Alex Mallen’s Shortform

Alex Mallen17 Jun 2025 16:31 UTC

4 points

23 comments1 min readLW link

A quick list of reward hacking interventions

Alex Mallen10 Jun 2025 0:58 UTC

49 points

5 comments3 min readLW link

The case for countermeasures to memetic spread of misaligned values

Alex Mallen28 May 2025 21:12 UTC

80 points

1 comment7 min readLW link

Political sycophancy as a model organism of scheming

Alex Mallen and Vivek Hebbar

12 May 2025 17:49 UTC

40 points

0 comments14 min readLW link

Training-time schemers vs behavioral schemers

Alex Mallen24 Apr 2025 19:07 UTC

52 points

9 comments6 min readLW link

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Alex Mallen, Charlie Griffin and Buck

24 Mar 2025 17:55 UTC

35 points

0 comments8 min readLW link