RLHF does work currently? What makes you think it doesn’t work currently?
This is definitely the crux so probably really the only point worth debating.
RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it’s still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don’t like, but I think agents will reliably overcome that obstacle as useful agents won’t just be outputting the most probable continuation, they’ll be searching through decision space and finding those unlikely continuations that will score well on its task. I don’t think RLHF does anything remotely analogous to making it care about whether it’s following your intent, or following a constitution, etc.
You’re definitely aware of the misalignment that still exists with our current RLHF’d models and have read the recent papers on alignment faking etc. so I probably can’t make an argument you haven’t already heard.
Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don’t.
One of the problems with low german as I understand it (I have mennonite grandparents that speak it), is that there’s no formal spelling system, and an abundance dialects. There’s also not a ton of low german content online to train off of, though there are a decent number of books written in low german.
I’d be curious to know how good LLM translation is despite that.