alexandracar comments on Detecting collusion through multi-agent interpretability

alexandracar 11 Apr 2026 7:51 UTC
1 point
0
The Stego results surprised me given the probe was trained on deliberative collusion. How robust is the deceptive direction to collusion types that look nothing like what is in the benchmark?