I’m not happy about this but it seems basically priced in, so not much update on p(doom).
We will soon have Bayesian updates to make. If we observe that incentives created during end-to-end RL naturally produce goal guarding and other dangerous cognitive properties, it will be bad news. If we observe this doesn’t happen, it will be good news (although not very good news because web research seems like it doesn’t require the full range of agency).
Likewise, if we observe models’ monitorability and interpretability start to tank as they think in neuralese, it will be bad news. If monitoring and interpretability are unaffected, good news.
How on earth could monitorability and interpretability NOT tank when they think in neuralese? Surely changing from English to some gibberish highdimensional vector language makes things harder, even if not impossible?
It’s not clear whether agents will think in neuralese, maybe end-to-end RL in English is good enough for the next few years and CoT messages won’t drift enough to allow steganography
Once agents think in either token gibberish or plain vectors maybe self-monitoring will still work fine. After all agents can translate between other languages just fine. We can use model organisms or some other clever experiments to check whether the agent faithfully translates its CoT or unavoidably starts lying to us as it gets more capable.
I care about the exact degree to which monitoring gets worse. Plausibly it gets somewhat worse but is still good enough to catch the model before it coups us.
If we’re using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would “understand” the steganography it uses—but you might have to supply so much of the context that it would be almost the same instance—so likely to adopt the same goals and use the same deceptions, if any.
So that route does seem like dangerous territory. You’d rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is “thinking” about.
I haven’t gotten far in figuring out how well this might work, but it is a possibility. I’ll say the little I have thought about in a soon-to-be-finished post.
I don’t see how monitoring and interpretability could be unaffected.
(So I take this as modestly bad news — I wasn’t totally sure orgs would use task-based end-to-end RL. I wasn’t sure if agentic scaffolding might prove easier—see the other discussion thread here for questions about whether it might actually work just as well if someone bothered to implement it.)
I’m not happy about this but it seems basically priced in, so not much update on p(doom).
We will soon have Bayesian updates to make. If we observe that incentives created during end-to-end RL naturally produce goal guarding and other dangerous cognitive properties, it will be bad news. If we observe this doesn’t happen, it will be good news (although not very good news because web research seems like it doesn’t require the full range of agency).
Likewise, if we observe models’ monitorability and interpretability start to tank as they think in neuralese, it will be bad news. If monitoring and interpretability are unaffected, good news.
Interesting times.
How on earth could monitorability and interpretability NOT tank when they think in neuralese? Surely changing from English to some gibberish highdimensional vector language makes things harder, even if not impossible?
It’s not clear whether agents will think in neuralese, maybe end-to-end RL in English is good enough for the next few years and CoT messages won’t drift enough to allow steganography
Once agents think in either token gibberish or plain vectors maybe self-monitoring will still work fine. After all agents can translate between other languages just fine. We can use model organisms or some other clever experiments to check whether the agent faithfully translates its CoT or unavoidably starts lying to us as it gets more capable.
I care about the exact degree to which monitoring gets worse. Plausibly it gets somewhat worse but is still good enough to catch the model before it coups us.
If we’re using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would “understand” the steganography it uses—but you might have to supply so much of the context that it would be almost the same instance—so likely to adopt the same goals and use the same deceptions, if any.
So that route does seem like dangerous territory. You’d rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is “thinking” about.
I haven’t gotten far in figuring out how well this might work, but it is a possibility. I’ll say the little I have thought about in a soon-to-be-finished post.
I don’t see how monitoring and interpretability could be unaffected.
(So I take this as modestly bad news — I wasn’t totally sure orgs would use task-based end-to-end RL. I wasn’t sure if agentic scaffolding might prove easier—see the other discussion thread here for questions about whether it might actually work just as well if someone bothered to implement it.)