It’s not clear whether agents will think in neuralese, maybe end-to-end RL in English is good enough for the next few years and CoT messages won’t drift enough to allow steganography
Once agents think in either token gibberish or plain vectors maybe self-monitoring will still work fine. After all agents can translate between other languages just fine. We can use model organisms or some other clever experiments to check whether the agent faithfully translates its CoT or unavoidably starts lying to us as it gets more capable.
I care about the exact degree to which monitoring gets worse. Plausibly it gets somewhat worse but is still good enough to catch the model before it coups us.
It’s not clear whether agents will think in neuralese, maybe end-to-end RL in English is good enough for the next few years and CoT messages won’t drift enough to allow steganography
Once agents think in either token gibberish or plain vectors maybe self-monitoring will still work fine. After all agents can translate between other languages just fine. We can use model organisms or some other clever experiments to check whether the agent faithfully translates its CoT or unavoidably starts lying to us as it gets more capable.
I care about the exact degree to which monitoring gets worse. Plausibly it gets somewhat worse but is still good enough to catch the model before it coups us.