Just wanted to mention that, though this is not currently the case, there are two instances I can currently think of where the AI can be a jailbreaker:
Yep, I’d really like people to distinguish between misuse and misalignment a lot more than people do currently, because they require quite different solutions.
Just wanted to mention that, though this is not currently the case, there are two instances I can currently think of where the AI can be a jailbreaker:
Jailbreaking the reward model to get a high score. (Toy-ish example here.)
Autonomous AI agents embedded within society jailbreak other models to achieve a goal/sub-goal.
Yep, I’d really like people to distinguish between misuse and misalignment a lot more than people do currently, because they require quite different solutions.