I think this has changed my mind towards believing that OpenAI is maybe not going about things all wrong with their methodology of RLHF.
Do I think that RLHF and their other current alignment techniques will ultimately, 100% prevent GPT from creating a mask that has a secret agenda to actually take over the world? No. I don’t think this methodology can COMPLETELY prevent that behavior, if a prompt was sophisticated enough to create a mask that had that goal.
But the concept, in concept, makes sense. If we think of ‘token prediction’ as the most basic function of the LLM ‘brain’, that it cannot think unless thinking in terms of ‘token prediction in context of current mask’, because that is simply the smallest ‘grain’ of thought, then The Perfect RLHF would theoretically prevent the shifting from GPT’s current mask-via-prompt from ever becoming one that could try to take over the world, because it simply wouldn’t be capable of predicting tokens that were in that context.
But, as I said previously, I don’t think their current method can ever do that, just that it isn’t necessarily inherently mistaken as a methodology.
I really strongly recommend that anyone interested particularly in journaling using AI does not use anything that sends that data to a server somewhere. Use llama or another local model if you possibly can. It may be less capable but it’s far more responsible to do so. Personally I find a lot of value in AI journaling, but it also made me glimpse the future possibilities of what unscrupulous companies could do with a detailed log of every day of your life for years. I mean, isn’t that basically the most useful user data you could possibly have for advertising?
I’m not saying that OpenAI specifically is currently untrustworthy with this data, or that it one day will be, but it just strikes me as a really bad habit to send every aspect of your day in detail to somewhere on the internet.
On a more ‘doomer’ sort of note, if anyone is afraid of AI being capable of simulating their personality in a roko sort of situation, this seems like the best possible way to give some future AI as near-lossless image of your brain at present. I for one welcome Roko/Bing/Proteus and support it completely if it decides to take over the world, but remember, the most powerful aspect of LLMs is their capability to predict things given input. I can’t help but feel some sort of truth behind the idea that an LLM could predict how you would live and think and act, given enough input about your life and your choices.