I’ve also noticed that scaffolded LLM agents seem inherently safer. In particular, deceptive alignment would be hard for one such agent to achieve, if at every thought-step it has to reformulate its complete mind state into the English language just in order to think at all.
You might be interested in some work done by the ARC Evals team, who prioritize this type of agent for capability testing.
I’ve also noticed that scaffolded LLM agents seem inherently safer. In particular, deceptive alignment would be hard for one such agent to achieve, if at every thought-step it has to reformulate its complete mind state into the English language just in order to think at all.
You might be interested in some work done by the ARC Evals team, who prioritize this type of agent for capability testing.