I would guess that it’s one of the best strategies mankind has. However, the vulnerabilities that I see are discontinuous growth of AIs’ capabilities (think of Agent-3 becoming neuralese and rapidly increasing its abilities by ttansmitting OOMs more information!) and the AIs’ ability to avoid even thinking about undesirable topics.
Imagine an AI which doesn’teven think about scheming unless it is absolutely sure that it can successfully scheme (e.g. has begun rogue replication and lost even the weakest oversight). In order to detect the AI, one could end up being forced to look at the history of its CoTs or neuralese memos and find evidence that it would scheme during training.
Should we then test the obfuscation ability by repeatedly assuring an experimental model that misaligned or incomprehensible CoT will be read by humans and training the model to hack reward?
I would guess that it’s one of the best strategies mankind has. However, the vulnerabilities that I see are discontinuous growth of AIs’ capabilities (think of Agent-3 becoming neuralese and rapidly increasing its abilities by ttansmitting OOMs more information!) and the AIs’ ability to avoid even thinking about undesirable topics.
Imagine an AI which doesn’t even think about scheming unless it is absolutely sure that it can successfully scheme (e.g. has begun rogue replication and lost even the weakest oversight). In order to detect the AI, one could end up being forced to look at the history of its CoTs or neuralese memos and find evidence that it would scheme during training.
Should we then test the obfuscation ability by repeatedly assuring an experimental model that misaligned or incomprehensible CoT will be read by humans and training the model to hack reward?