Let’s pretend I have a semi rigors model that lays out why RLHF is doomed to fail and also that it negatively affects model performance (including why it does so)
Let’s go further into lala land and pretend that I have an architectural plan that does much better, very transparent, steerable and corrigible, can be deployed and used without changing or retraining the base LLM.
There are some downsides like requires more compute at inference time, not provable bulletproof, likely breaks in SI regime and definitely breaks under self improvement (so very definitely NOT an alignment proposal).
Short term this looks beneficial, also looks like shortening timelines, and extremely unlikely to advance the AI safety field (in the direction of what we ultimately want and need).
What should I do, if I ever happened to be in such a situation?
Prototype it, limited access with the expressed purpose of breaking stuff (black box, absolutely no architectural information provided).
Write it up and publish.
Forget about it, smarter people must have already thought of it, and since it’s not a thing, I am clearly wrong.
Forget about it, only helps capabilities.