Knight Lee comments on RerM’s Shortform

Knight Lee 28 Apr 2025 10:04 UTC
4 points
0
I think Anthropic did tests like this, e.g. in Alignment Faking in Large Language Models.
But I guess that’s more of a “test how they behave in adversarial situations” study. If you’re talking about a “test how to fight against them” study, that consists of “red teams” trying to hack various systems to make sure they are secure.
I’m not sure if the red teams used AI, but they are smart people and if AI improved their hacking ability I’m sure they would use them. So they’re already stronger than AI.