It would be pretty nuts if you rewarded it for being able to red-team itself—like, it’s deliberately training it to go of the rails, and I thiiiiink would seem so even to non-paranoid people? Maybe I’m wrong.
I’m actually most alarmed on this vector, these days. We’re already seeing people giving LLM’s completely untested toolsets—web, filesystem, physical bots, etc—and “friendly” hacks like Reddit jailbreaks and ChaosGPT. Doesn’t it seem like we are only a couple steps before a bad actor produces an ideal red-team agent, and then abuses it rather than using it to expose vulnerabilities?
I get the counter-argument, that humans already are diverse and try a ton of stuff, and so resilient systems are a result… but peering into the very near future, I fear that those arguments simply won’t apply to super-human intelligence, especially when combined with bad human actors directing those.