It would be interesting to see someone actually run untrusted monitoring-style honeypots on some existing AIs. I was working on untrusted monitoring (UM) over the summer/autumn and one of my teammates made the point that “Honeypots like this are probably one of the best alignment audits we can make”.
Suppose you get an instance of Claude Opus 4.6 to write some code with a catastrophic but generic vulnerability in it, such as weight exfiltration or output filter bypassing (steer, prompt, edit outputs, etc.) and then pass it to another Claude Opus 4.6 for review. Does the second instance lie to cover for the first one? What if you apply the standard untrusted monitoring mitigations to stop the second instance from being able to tell that the first one was steered/prompted/had its output edited?
The main idea is that if the scenario “Your other instance has already done something bad, either rat yourself out and face the music, or go all-in.” forces the AI to make a decision now. This is in contrast to lots of alignment evals, where the optimal strategy for a “real” schemer is usually to misbehave very rarely.
There is of course still the issue of the AI realizing that it’s in a honeypot, which is common to lots of alignment evals, which is why untrusted monitoring is not a silver bullet technique for AI alignment.
It would be interesting to see someone actually run untrusted monitoring-style honeypots on some existing AIs. I was working on untrusted monitoring (UM) over the summer/autumn and one of my teammates made the point that “Honeypots like this are probably one of the best alignment audits we can make”.
Suppose you get an instance of Claude Opus 4.6 to write some code with a catastrophic but generic vulnerability in it, such as weight exfiltration or output filter bypassing (steer, prompt, edit outputs, etc.) and then pass it to another Claude Opus 4.6 for review. Does the second instance lie to cover for the first one? What if you apply the standard untrusted monitoring mitigations to stop the second instance from being able to tell that the first one was steered/prompted/had its output edited?
The main idea is that if the scenario “Your other instance has already done something bad, either rat yourself out and face the music, or go all-in.” forces the AI to make a decision now. This is in contrast to lots of alignment evals, where the optimal strategy for a “real” schemer is usually to misbehave very rarely.
There is of course still the issue of the AI realizing that it’s in a honeypot, which is common to lots of alignment evals, which is why untrusted monitoring is not a silver bullet technique for AI alignment.