Marius Hobbhahn

Karma: 5,545

I’m the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.

I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.

For more see https://www.mariushobbhahn.com/aboutme/

I subscribe to Crocker’s Rules

Marius Hobbhahn 26 Jan 2026 12:29 UTC
3 points
0
in reply to: Yonatan Cale’s comment on: The case for AGI safety products
I think it’s plausible that good monitors will make consumer applications of AI more capable and effective. In some sense, safety is a relevant blocker for parts of that at the moment.

Though, I think it is quite unlikely to push the frontier and I think the negative externalities of non-lab developers being faster at coding are very small. On average, it just seems to increase productivity.

I’d also expect that the monitors we build and are not directly targeted at making frontier AIs more effective, don’t happen to be more effective at that then the 100s of employees who push the boundaries of the frontier full-time.

So on balance, I think the risk is pretty low and the benefits are high. This was one of the considerations we thought through in depth before making the decision to make monitors.

Marius Hobbhahn 22 Jan 2026 9:49 UTC
LW: 7 AF: 3
6
AF
on: When should we train against a scheming monitor?
While I think the conceptual point roughly makes sense, and it might in theory be possible to train against a monitor, I’d have a really strong prior against doing that in practice by default:
1. Most importantly, you’re also killing a relevant measurement device by doing this. Unless you have at least one but probably more other measurement devices that you’re confident in AND are orthogonal to the monitor you’re training against, then maybe. But in practice we don’t have these (e.g. probes and interp just isn’t good enough atm).
1a. I think your process of estimating R in footnote 2 suggests that we currently cannot estimate R because I don’t expect we can find enough uncorrelated monitors right now.
2. We don’t understand how these probabilities and distributions are changed at the moment, so it’s hard to say when we’ve succeeded vs. not.
3. I think there are institutional incentives to pretend to have succeeded no matter if you have and it’s much harder to prove that you haven’t if you just made the most obvious measurement device useless (i.e. CoT looks nice).

So for now my default recommendation is to try to keep the monitors as far from training as possible.

The case for AGI safety products

Marius Hobbhahn21 Jan 2026 17:23 UTC

60 points

6 comments12 min readLW link

Marius Hobbhahn 27 Nov 2025 11:11 UTC
LW: 62 AF: 25
33
AF
on: Alignment will happen by default. What’s next?
I’m very surprised that people seem to update this way. My takeaway over the last year has been that misalignment is at least as hard as I thought it was a year ago, or harder, and definitely nowhere near solved.

There were a lot of things that caught developers by surprise, e.g., the reward hacking of the coding models, or emergent misalignment-related issues, or eval awareness messing with the results.

My overall sense is also that high-compute agentic RL makes all problems much harder because you reward the model on consequences, and it’s hard to set exactly the right constraints in that setting. I also think eval awareness is rapidly rising and makes alignment harder across the board, and makes scheming-related issues plausible for the first time in real models.

I also feel like none of the tools we have right now really work very well. They are all reducing the problem a bit, potentially enough to carry us for a while, but not enough that I would deeply trust a system built by them.
- Interpretability is somewhat useful now, but still not quite able to be used for effective debugging or aligning the internal concepts of the model.
- Training-based methods seem to work in a reductionist way, e.g., our anti-scheming training generalized better than I anticipated, but it is clearly not enough, and I don’t know how to close the gap.
- Control seems promising, but still needs to be validated in practice in much more detail.

Idk, the answers we have right now just don’t seem adequate at all to the scale of the problem to me.

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny and alex.lloyd

17 Sep 2025 16:59 UTC

127 points

19 comments1 min readLW link

(antischeming.ai)

Building Black-box Scheming Monitors

CorrigibleAgent, richbc, Simon Storf and Marius Hobbhahn

29 Jul 2025 17:41 UTC

45 points

18 comments11 min readLW link

Marius Hobbhahn 5 Jul 2025 10:11 UTC
2 points
0
in reply to: Fabien Roger’s comment on: Do models know when they are being evaluated?
Nice! Would be interested to see what it says about your internal evaluations

Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals

Marius Hobbhahn3 Jul 2025 15:57 UTC

75 points

0 comments1 min readLW link

(www.apolloresearch.ai)

Why “training against scheming” is hard

Marius Hobbhahn24 Jun 2025 19:08 UTC

66 points

2 comments12 min readLW link

Marius Hobbhahn 2 Jun 2025 18:42 UTC
2 points
0
in reply to: Ben Millwood’s comment on: Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
In one of my MATS projects we found that some models have a bias to think they’re always being evaluated, including in real-world scenarios. The paper isn’t public yet. But it seems like a pretty brittle belief that the models don’t hold super strongly. I think this can be part of a strategy, but should never be load-bearing.

We should try to automate AI safety work asap

Marius Hobbhahn26 Apr 2025 16:35 UTC

114 points

10 comments15 min readLW link

100+ concrete projects and open problems in evals

Marius Hobbhahn22 Mar 2025 15:21 UTC

75 points

1 comment1 min readLW link

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer and Marius Hobbhahn

17 Mar 2025 19:11 UTC

189 points

9 comments6 min readLW link

Marius Hobbhahn 7 Mar 2025 9:14 UTC
LW: 4 AF: 4
2
AF
in reply to: maxnadeau’s comment on: We should start looking for scheming “in the wild”
Good point!

Yes, I use the term scheming in a much broader way, similar to how we use it in the in-context scheming paper. I would assume that our scheming term is even broader than Joe’s alignment-faking because it also includes taking direct covert action like disabling oversight (which arguably is not alignment-faking).

We should start looking for scheming “in the wild”

Marius Hobbhahn6 Mar 2025 13:49 UTC

91 points

4 comments5 min readLW link

For scheming, we should first focus on detection and then on prevention

Marius Hobbhahn4 Mar 2025 15:22 UTC

53 points

7 comments5 min readLW link

Forecasting Frontier Language Model Agent Capabilities

fidgetsinner, Axel Højmark, Jérémy Scheurer and Marius Hobbhahn

24 Feb 2025 16:51 UTC

35 points

0 comments5 min readLW link

(www.apolloresearch.ai)

Do models know when they are being evaluated?

fidgetsinner, Giles, Joe Needham and Marius Hobbhahn

17 Feb 2025 23:13 UTC

57 points

9 comments12 min readLW link

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

6 Feb 2025 15:46 UTC

104 points

9 comments2 min readLW link

(arxiv.org)

Marius Hobbhahn 2 Feb 2025 16:39 UTC
2 points
0
in reply to: Petropolitan’s comment on: Catastrophe through Chaos
There are two sections that I think make this explicit:

1. No failure mode is sufficient to justify bigger actions.
2. Some scheming is totally normal.

My main point is that even things that would seem like warning shots today, e.g. severe loss of life, will look small in comparison to the benefits at the time, thus not providing any reason to pause.