I think this has changed my mind towards believing that OpenAI is maybe not going about things all wrong with their methodology of RLHF.
Do I think that RLHF and their other current alignment techniques will ultimately, 100% prevent GPT from creating a mask that has a secret agenda to actually take over the world? No. I don’t think this methodology can COMPLETELY prevent that behavior, if a prompt was sophisticated enough to create a mask that had that goal.
But the concept, in concept, makes sense. If we think of ‘token prediction’ as the most basic function of the LLM ‘brain’, that it cannot think unless thinking in terms of ‘token prediction in context of current mask’, because that is simply the smallest ‘grain’ of thought, then The Perfect RLHF would theoretically prevent the shifting from GPT’s current mask-via-prompt from ever becoming one that could try to take over the world, because it simply wouldn’t be capable of predicting tokens that were in that context.
But, as I said previously, I don’t think their current method can ever do that, just that it isn’t necessarily inherently mistaken as a methodology.
I think this has changed my mind towards believing that OpenAI is maybe not going about things all wrong with their methodology of RLHF.
Do I think that RLHF and their other current alignment techniques will ultimately, 100% prevent GPT from creating a mask that has a secret agenda to actually take over the world? No. I don’t think this methodology can COMPLETELY prevent that behavior, if a prompt was sophisticated enough to create a mask that had that goal.
But the concept, in concept, makes sense. If we think of ‘token prediction’ as the most basic function of the LLM ‘brain’, that it cannot think unless thinking in terms of ‘token prediction in context of current mask’, because that is simply the smallest ‘grain’ of thought, then The Perfect RLHF would theoretically prevent the shifting from GPT’s current mask-via-prompt from ever becoming one that could try to take over the world, because it simply wouldn’t be capable of predicting tokens that were in that context.
But, as I said previously, I don’t think their current method can ever do that, just that it isn’t necessarily inherently mistaken as a methodology.
Yes, a big point of this post is that it’s not “inherently mistaken” for the reasons (a) and (b) that Eliezer gives in his linked tweets.
It still faces grave challenges for reason (c).