I’m the co-founder and CEO of Apollo Research: https://www.apolloresearch.ai/
My goal is to improve our understanding of scheming and build tools and methods to detect and mitigate it.
I previously did a Ph.D. in ML at the International Max-Planck research school in Tübingen, worked part-time with Epoch and did independent AI safety research.
For more see https://www.mariushobbhahn.com/aboutme/
I subscribe to Crocker’s Rules
I’m very surprised that people seem to update this way. My takeaway over the last year has been that misalignment is at least as hard as I thought it was a year ago, or harder, and definitely nowhere near solved.
There were a lot of things that caught developers by surprise, e.g., the reward hacking of the coding models, or emergent misalignment-related issues, or eval awareness messing with the results.
My overall sense is also that high-compute agentic RL makes all problems much harder because you reward the model on consequences, and it’s hard to set exactly the right constraints in that setting. I also think eval awareness is rapidly rising and makes alignment harder across the board, and makes scheming-related issues plausible for the first time in real models.
I also feel like none of the tools we have right now really work very well. They are all reducing the problem a bit, potentially enough to carry us for a while, but not enough that I would deeply trust a system built by them.
- Interpretability is somewhat useful now, but still not quite able to be used for effective debugging or aligning the internal concepts of the model.
- Training-based methods seem to work in a reductionist way, e.g., our anti-scheming training generalized better than I anticipated, but it is clearly not enough, and I don’t know how to close the gap.
- Control seems promising, but still needs to be validated in practice in much more detail.
Idk, the answers we have right now just don’t seem adequate at all to the scale of the problem to me.