If we’re using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would “understand” the steganography it uses—but you might have to supply so much of the context that it would be almost the same instance—so likely to adopt the same goals and use the same deceptions, if any.
So that route does seem like dangerous territory. You’d rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is “thinking” about.
I haven’t gotten far in figuring out how well this might work, but it is a possibility. I’ll say the little I have thought about in a soon-to-be-finished post.
I don’t see how monitoring and interpretability could be unaffected.
(So I take this as modestly bad news — I wasn’t totally sure orgs would use task-based end-to-end RL. I wasn’t sure if agentic scaffolding might prove easier—see the other discussion thread here for questions about whether it might actually work just as well if someone bothered to implement it.)
If we’re using another instance of the same base model to do the monitoring and interpreting? It seems like that could partially preserve monitoring and interpreting. Another instance would “understand” the steganography it uses—but you might have to supply so much of the context that it would be almost the same instance—so likely to adopt the same goals and use the same deceptions, if any.
So that route does seem like dangerous territory. You’d rather not trust even a fresh instance of the same model to tell you what its agentic counterpart is “thinking” about.
I haven’t gotten far in figuring out how well this might work, but it is a possibility. I’ll say the little I have thought about in a soon-to-be-finished post.
I don’t see how monitoring and interpretability could be unaffected.
(So I take this as modestly bad news — I wasn’t totally sure orgs would use task-based end-to-end RL. I wasn’t sure if agentic scaffolding might prove easier—see the other discussion thread here for questions about whether it might actually work just as well if someone bothered to implement it.)