OpenAIs disclosure of Scheming in LLMs is of course a big red flag, since it shows the model can enable an agent to strategize deceptivey, in theory, should it develop goals.
But it is important to realize that this is also a major red flag because any agent running on the model can now hide having goals in the first place.
OpenAI addresses that a goal can be subversed and kept hidden by scheming agents. But this presumes a task is already set and subseq. subversed.
Emrgent intent/persistent goal seeking is the biggest red flag there is. It should be impossible inside LLMs themselves, and also impossible in any agentic process-instance run continously, without significant scaffolding. The ability to hide intent before we could detect it is therefore catastrophic. This point is somewhat non-obvious.
It’s not just that IF goals emerge the agent may mislead us, but that goals may emerge without us noticing at all.
OpenAIs disclosure of Scheming in LLMs is of course a big red flag, since it shows the model can enable an agent to strategize deceptivey, in theory, should it develop goals.
But it is important to realize that this is also a major red flag because any agent running on the model can now hide having goals in the first place.
OpenAI addresses that a goal can be subversed and kept hidden by scheming agents. But this presumes a task is already set and subseq. subversed.
Emrgent intent/persistent goal seeking is the biggest red flag there is. It should be impossible inside LLMs themselves, and also impossible in any agentic process-instance run continously, without significant scaffolding. The ability to hide intent before we could detect it is therefore catastrophic. This point is somewhat non-obvious.
It’s not just that IF goals emerge the agent may mislead us, but that goals may emerge without us noticing at all.