I think it’s right. Inner alignment is getting the mesa-optimizers (agents) aligned with the overall objective. Outer alignment ensures the AI understands an overall objective that humans want.
Crissman
Karma: 10
Doom doubts—is inner alignment a likely problem?
I see. So the agent issue I address above is a sub-issue of overall inner alignment.
In particular, I was the addressing deceptively aligned mesa-optimizers, as discussed here: https://astralcodexten.substack.com/p/deceptively-aligned-mesa-optimizers
Thanks!
This started out as an interesting concrete article, but then it got too meta, and I stopped reading. 🤷♂️