You could, I think, have a system where performance clearly depends on some key beliefs. So then you still could change the beliefs, but that change would significantly damage capabilities. I guess that could be good enough?
E.g. I think if you somehow made me really believe the Earth is flat, this would harm my research skills. Or perhaps even if you made me e.g. hate gays.
Consider backdoors, as in the Sleeper Agents paper. So, a conditional policy triggered by some specific user prompt. You could probably quite easily fine-tune a recent model to be pro-life on even days and pro-choice on odd days. These would be just fully general, consistent behaviors, i.e. you could get a model that would present these date-dependant beliefs consistently among all possible contexts.
Now, imagine someone controls all of the environment you live in. Like, literally everything, except that they don’t have any direct access to your brain. Could they implement similar backdoor in you? They could force you to behave that way, buy could they make you really believe that?
My guess is not, and one reason (there are also others but that’s a different topic) is that humans like me and you have a very deep belief “current date doesn’t make a difference for whether abortion is good and bad” that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?
So couldn’t we have LLMs be like humans in this regard? I don’t see a good reason for why this wouldn’t be possible.
You could, I think, have a system where performance clearly depends on some key beliefs. So then you still could change the beliefs, but that change would significantly damage capabilities. I guess that could be good enough? E.g. I think if you somehow made me really believe the Earth is flat, this would harm my research skills. Or perhaps even if you made me e.g. hate gays.
Oh, that’s a really interesting design approach, I haven’t run across something like that before.
Consider backdoors, as in the Sleeper Agents paper. So, a conditional policy triggered by some specific user prompt. You could probably quite easily fine-tune a recent model to be pro-life on even days and pro-choice on odd days. These would be just fully general, consistent behaviors, i.e. you could get a model that would present these date-dependant beliefs consistently among all possible contexts.
Now, imagine someone controls all of the environment you live in. Like, literally everything, except that they don’t have any direct access to your brain. Could they implement similar backdoor in you? They could force you to behave that way, buy could they make you really believe that?
My guess is not, and one reason (there are also others but that’s a different topic) is that humans like me and you have a very deep belief “current date doesn’t make a difference for whether abortion is good and bad” that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?
So couldn’t we have LLMs be like humans in this regard? I don’t see a good reason for why this wouldn’t be possible.
I’m not sure if this is a great analogy : )