How on earth could monitorability and interpretability NOT tank when they think in neuralese? Surely changing from English to some gibberish highdimensional vector language makes things harder, even if not impossible?
It’s not clear whether agents will think in neuralese, maybe end-to-end RL in English is good enough for the next few years and CoT messages won’t drift enough to allow steganography
Once agents think in either token gibberish or plain vectors maybe self-monitoring will still work fine. After all agents can translate between other languages just fine. We can use model organisms or some other clever experiments to check whether the agent faithfully translates its CoT or unavoidably starts lying to us as it gets more capable.
I care about the exact degree to which monitoring gets worse. Plausibly it gets somewhat worse but is still good enough to catch the model before it coups us.
If we’re using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would “understand” the steganography it uses—but you might have to supply so much of the context that it would be almost the same instance—so likely to adopt the same goals and use the same deceptions, if any.
So that route does seem like dangerous territory. You’d rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is “thinking” about.
I haven’t gotten far in figuring out how well this might work, but it is a possibility. I’ll say the little I have thought about in a soon-to-be-finished post.
I don’t see how monitoring and interpretability could be unaffected.
(So I take this as modestly bad news — I wasn’t totally sure orgs would use task-based end-to-end RL. I wasn’t sure if agentic scaffolding might prove easier—see the other discussion thread here for questions about whether it might actually work just as well if someone bothered to implement it.)
How on earth could monitorability and interpretability NOT tank when they think in neuralese? Surely changing from English to some gibberish highdimensional vector language makes things harder, even if not impossible?
It’s not clear whether agents will think in neuralese, maybe end-to-end RL in English is good enough for the next few years and CoT messages won’t drift enough to allow steganography
Once agents think in either token gibberish or plain vectors maybe self-monitoring will still work fine. After all agents can translate between other languages just fine. We can use model organisms or some other clever experiments to check whether the agent faithfully translates its CoT or unavoidably starts lying to us as it gets more capable.
I care about the exact degree to which monitoring gets worse. Plausibly it gets somewhat worse but is still good enough to catch the model before it coups us.
If we’re using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would “understand” the steganography it uses—but you might have to supply so much of the context that it would be almost the same instance—so likely to adopt the same goals and use the same deceptions, if any.
So that route does seem like dangerous territory. You’d rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is “thinking” about.
I haven’t gotten far in figuring out how well this might work, but it is a possibility. I’ll say the little I have thought about in a soon-to-be-finished post.
I don’t see how monitoring and interpretability could be unaffected.
(So I take this as modestly bad news — I wasn’t totally sure orgs would use task-based end-to-end RL. I wasn’t sure if agentic scaffolding might prove easier—see the other discussion thread here for questions about whether it might actually work just as well if someone bothered to implement it.)