RSS

Marius Hobbhahn

Karma: 5,296

I’m the co-founder and CEO of Apollo Research: https://​​www.apolloresearch.ai/​​
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.

For more see https://​​www.mariushobbhahn.com/​​aboutme/​​

I subscribe to Crocker’s Rules

Build­ing Black-box Schem­ing Monitors

29 Jul 2025 17:41 UTC
38 points
18 comments11 min readLW link

Re­search Note: Our schem­ing pre­cur­sor evals had limited pre­dic­tive power for our in-con­text schem­ing evals

Marius Hobbhahn3 Jul 2025 15:57 UTC
74 points
0 comments1 min readLW link
(www.apolloresearch.ai)

Why “train­ing against schem­ing” is hard

Marius Hobbhahn24 Jun 2025 19:08 UTC
63 points
2 comments12 min readLW link

We should try to au­to­mate AI safety work asap

Marius Hobbhahn26 Apr 2025 16:35 UTC
113 points
10 comments15 min readLW link

100+ con­crete pro­jects and open prob­lems in evals

Marius Hobbhahn22 Mar 2025 15:21 UTC
74 points
1 comment1 min readLW link

Claude Son­net 3.7 (of­ten) knows when it’s in al­ign­ment evaluations

17 Mar 2025 19:11 UTC
184 points
9 comments6 min readLW link

We should start look­ing for schem­ing “in the wild”

Marius Hobbhahn6 Mar 2025 13:49 UTC
91 points
4 comments5 min readLW link

For schem­ing, we should first fo­cus on de­tec­tion and then on prevention

Marius Hobbhahn4 Mar 2025 15:22 UTC
49 points
7 comments5 min readLW link

Fore­cast­ing Fron­tier Lan­guage Model Agent Capabilities

24 Feb 2025 16:51 UTC
35 points
0 comments5 min readLW link
(www.apolloresearch.ai)

Do mod­els know when they are be­ing eval­u­ated?

17 Feb 2025 23:13 UTC
59 points
8 comments12 min readLW link

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

6 Feb 2025 15:46 UTC
104 points
9 comments2 min readLW link
(arxiv.org)

Catas­tro­phe through Chaos

Marius Hobbhahn31 Jan 2025 14:19 UTC
187 points
17 comments12 min readLW link