RLHF does work currently? What makes you think it doesn’t work currently?
This is definitely the crux so probably really the only point worth debating.
RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it’s still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don’t like, but I think agents will reliably overcome that obstacle as useful agents won’t just be outputting the most probable continuation, they’ll be searching through decision space and finding those unlikely continuations that will score well on its task. I don’t think RLHF does anything remotely analogous to making it care about whether it’s following your intent, or following a constitution, etc.
You’re definitely aware of the misalignment that still exists with our current RLHF’d models and have read the recent papers on alignment faking etc. so I probably can’t make an argument you haven’t already heard.
Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don’t.
This is definitely the crux so probably really the only point worth debating.
RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it’s still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don’t like, but I think agents will reliably overcome that obstacle as useful agents won’t just be outputting the most probable continuation, they’ll be searching through decision space and finding those unlikely continuations that will score well on its task. I don’t think RLHF does anything remotely analogous to making it care about whether it’s following your intent, or following a constitution, etc.
You’re definitely aware of the misalignment that still exists with our current RLHF’d models and have read the recent papers on alignment faking etc. so I probably can’t make an argument you haven’t already heard.
Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don’t.