StanislavKrym comments on Building and evaluating alignment auditing agents

StanislavKrym 25 Jul 2025 10:52 UTC
1 point
0
This is excellent news. Alas, there are some open questions:
1. Does it to generalise to further models (e.g. Claude Sonnet 5 vs finetuned or prompted Claude Sonnet 5 or Claude Sonnet 4 vs Claude Sonnet 5 as compared to CS4 vs finetuned or prompted CS4)?
2. The AI-2027 forecast has Agent-3 and Agent-4 raised in similarly misaligned training environments. If Agent-3 is tasked with testing itself or Agent-4, then would it actually find or report the misalignment? Could this be simulated by tasking the quirky Claudes to audit their copies?
3. The most concerning feature is that all three agents failed to discover the more important quirks, which are research sandbagging and hardcoded test cases. These two are among the more dangerous quirks, as are Emotional Bond or Self-Promotion, but these two got discovered. Why do the failures correlate with the danger that the quirks pose?