Thomas Kwa comments on OpenAI releases deep research agent

Thomas Kwa 3 Feb 2025 23:49 UTC
14 points
0
- It’s not clear whether agents will think in neuralese, maybe end-to-end RL in English is good enough for the next few years and CoT messages won’t drift enough to allow steganography
- Once agents think in either token gibberish or plain vectors maybe self-monitoring will still work fine. After all agents can translate between other languages just fine. We can use model organisms or some other clever experiments to check whether the agent faithfully translates its CoT or unavoidably starts lying to us as it gets more capable.
- I care about the exact degree to which monitoring gets worse. Plausibly it gets somewhat worse but is still good enough to catch the model before it coups us.