AlexMeinke

Karma: 785

AlexMeinke 5 Mar 2026 11:40 UTC
4 points
2
in reply to: Steven Byrnes’s comment on: Physics of RL: Toy scaling laws for the emergence of reward-seeking
Mhh… I don’t think (1) vs (2 & 3) is something I was aiming to distinguish.
I would say behaviorally 3 is perfectly consistent with a reward-seeker.

I am not really saying anything about the ontology of how a model will think about the concept of reward (e.g. “What does the grader want?”, “What will update my weights?”, “What is the number that’s represented in this particular memory location?”). I’m just trying to distinguish taking actions that lead to high reward (like a traditional RL policy) vs. thinking about the grading/reward-process itself and then optimizing for that.

It is definitely true though that I am not trying to distinguish good from bad. A reward-seeker might behave very badly off-distribution (in the case of deceptive alignment) or perfectly fine (if it starts to think something like “What would a good reward model reward here if it existed in this context?”).

Physics of RL: Toy scaling laws for the emergence of reward-seeking

AlexMeinke4 Mar 2026 8:12 UTC

108 points

9 comments10 min readLW link

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny and alex.lloyd

17 Sep 2025 16:59 UTC

133 points

19 comments1 min readLW link

(antischeming.ai)

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

17 Dec 2024 23:58 UTC

116 points

1 comment2 min readLW link

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

5 Dec 2024 22:11 UTC

211 points

24 comments7 min readLW link

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn and AlexMeinke

19 Nov 2024 0:10 UTC

73 points

12 comments28 min readLW link

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans and Marius Hobbhahn

8 Jul 2024 22:24 UTC

109 points

40 comments5 min readLW link 1 review

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

29 May 2024 17:44 UTC

93 points

0 comments7 min readLW link

A starter guide for evals

Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb and AlexMeinke

8 Jan 2024 18:24 UTC

58 points

2 comments12 min readLW link

(www.apolloresearch.ai)

Paper: Tell, Don’t Show- Declarative facts influence how LLMs generalize

Owain_Evans and AlexMeinke

19 Dec 2023 19:14 UTC

45 points

4 comments6 min readLW link

(arxiv.org)