RSS

Marius Hobbhahn

Karma: 5,296

I’m the co-founder and CEO of Apollo Research: https://​​www.apolloresearch.ai/​​
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.

For more see https://​​www.mariushobbhahn.com/​​aboutme/​​

I subscribe to Crocker’s Rules

Build­ing Black-box Schem­ing Monitors

29 Jul 2025 17:41 UTC
38 points
18 comments11 min readLW link

Re­search Note: Our schem­ing pre­cur­sor evals had limited pre­dic­tive power for our in-con­text schem­ing evals

Marius Hobbhahn3 Jul 2025 15:57 UTC
74 points
0 comments1 min readLW link
(www.apolloresearch.ai)

Why “train­ing against schem­ing” is hard

Marius Hobbhahn24 Jun 2025 19:08 UTC
63 points
2 comments12 min readLW link

We should try to au­to­mate AI safety work asap

Marius Hobbhahn26 Apr 2025 16:35 UTC
113 points
10 comments15 min readLW link

100+ con­crete pro­jects and open prob­lems in evals

Marius Hobbhahn22 Mar 2025 15:21 UTC
74 points
1 comment1 min readLW link

Claude Son­net 3.7 (of­ten) knows when it’s in al­ign­ment evaluations

17 Mar 2025 19:11 UTC
184 points
9 comments6 min readLW link

We should start look­ing for schem­ing “in the wild”

Marius Hobbhahn6 Mar 2025 13:49 UTC
91 points
4 comments5 min readLW link

For schem­ing, we should first fo­cus on de­tec­tion and then on prevention

Marius Hobbhahn4 Mar 2025 15:22 UTC
49 points
7 comments5 min readLW link

Fore­cast­ing Fron­tier Lan­guage Model Agent Capabilities

24 Feb 2025 16:51 UTC
35 points
0 comments5 min readLW link
(www.apolloresearch.ai)

Do mod­els know when they are be­ing eval­u­ated?

17 Feb 2025 23:13 UTC
59 points
8 comments12 min readLW link

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

6 Feb 2025 15:46 UTC
104 points
9 comments2 min readLW link
(arxiv.org)

Catas­tro­phe through Chaos

Marius Hobbhahn31 Jan 2025 14:19 UTC
187 points
17 comments12 min readLW link

What’s the short timeline plan?

Marius Hobbhahn2 Jan 2025 14:59 UTC
358 points
49 comments23 min readLW link

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

17 Dec 2024 23:58 UTC
115 points
1 comment2 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

5 Dec 2024 22:11 UTC
210 points
24 comments7 min readLW link

Train­ing AI agents to solve hard prob­lems could lead to Scheming

19 Nov 2024 0:10 UTC
61 points
12 comments28 min readLW link

Which evals re­sources would be good?

Marius Hobbhahn16 Nov 2024 14:24 UTC
51 points
4 comments5 min readLW link

The Evals Gap

Marius Hobbhahn11 Nov 2024 16:42 UTC
55 points
7 comments7 min readLW link
(www.apolloresearch.ai)

Toward Safety Cases For AI Scheming

31 Oct 2024 17:20 UTC
60 points
1 comment2 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

15 Oct 2024 18:25 UTC
30 points
0 comments18 min readLW link