Marius Hobbhahn

Karma: 5,406

I’m the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.

For more see https://www.mariushobbhahn.com/aboutme/

I subscribe to Crocker’s Rules

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny and alex.lloyd

17 Sep 2025 16:59 UTC

125 points

19 comments1 min readLW link

(antischeming.ai)

Building Black-box Scheming Monitors

james__p, richbc, Simon Storf and Marius Hobbhahn

29 Jul 2025 17:41 UTC

42 points

18 comments11 min readLW link

Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals

Marius Hobbhahn3 Jul 2025 15:57 UTC

75 points

0 comments1 min readLW link

(www.apolloresearch.ai)

Why “training against scheming” is hard

Marius Hobbhahn24 Jun 2025 19:08 UTC

63 points

2 comments12 min readLW link

We should try to automate AI safety work asap

Marius Hobbhahn26 Apr 2025 16:35 UTC

113 points

10 comments15 min readLW link

100+ concrete projects and open problems in evals

Marius Hobbhahn22 Mar 2025 15:21 UTC

74 points

1 comment1 min readLW link

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer and Marius Hobbhahn

17 Mar 2025 19:11 UTC

188 points

9 comments6 min readLW link

We should start looking for scheming “in the wild”

Marius Hobbhahn6 Mar 2025 13:49 UTC

91 points

4 comments5 min readLW link

For scheming, we should first focus on detection and then on prevention

Marius Hobbhahn4 Mar 2025 15:22 UTC

49 points

7 comments5 min readLW link

Forecasting Frontier Language Model Agent Capabilities

Govind Pimpale, Axel Højmark, Jérémy Scheurer and Marius Hobbhahn

24 Feb 2025 16:51 UTC

35 points

0 comments5 min readLW link

(www.apolloresearch.ai)

Do models know when they are being evaluated?

Govind Pimpale, Giles, Joe Needham and Marius Hobbhahn

17 Feb 2025 23:13 UTC

57 points

9 comments12 min readLW link

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

6 Feb 2025 15:46 UTC

104 points

9 comments2 min readLW link

(arxiv.org)

Catastrophe through Chaos

Marius Hobbhahn31 Jan 2025 14:19 UTC

187 points

17 comments12 min readLW link

What’s the short timeline plan?

Marius Hobbhahn2 Jan 2025 14:59 UTC

360 points

49 comments23 min readLW link

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

17 Dec 2024 23:58 UTC

115 points

1 comment2 min readLW link

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

5 Dec 2024 22:11 UTC

210 points

24 comments7 min readLW link

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn and AlexMeinke

19 Nov 2024 0:10 UTC

72 points

12 comments28 min readLW link

Which evals resources would be good?

Marius Hobbhahn16 Nov 2024 14:24 UTC

51 points

4 comments5 min readLW link

The Evals Gap

Marius Hobbhahn11 Nov 2024 16:42 UTC

55 points

7 comments7 min readLW link

(www.apolloresearch.ai)

Toward Safety Cases For AI Scheming

Mikita Balesni and Marius Hobbhahn

31 Oct 2024 17:20 UTC

60 points

1 comment2 min readLW link