Among other issues, we might be learning this early item from a meta-predictable sequence of unpleasant surprises: Training capabilities out of neural networks is asymmetrically harder than training them into the network.
Or put with some added burdensome detail but more concretely visualizable: To predict a sizable chunk of Internet text, the net needs to learn something complicated and general with roots in lots of places; learning this way is hard, the gradient descent algorithm has to find a relatively large weight pattern, albeit presumably gradually so, and then that weight pattern might get used by other things. When you then try to fine-tune the net not to use that capability, there’s probably a lot of simple patches to “Well don’t use the capability here...” that are much simpler to learn than to unroot the deep capability that may be getting used in multiple places, and gradient descent might turn up those simple patches first. Heck, the momentum algorithm might specifically avoid breaking the original capabilities and specifically put in narrow patches, since it doesn’t want to update the earlier weights in the opposite direction of previous gradients.
Of course there’s no way to know if this complicated-sounding hypothesis of mine is correct, since nobody knows what goes on inside neural nets at that level of transparency, nor will anyone know until the world ends.
If I train a human to self-censor certain subjects, I’m pretty sure that would happen by creating an additional subcircuit within their brain where a classifier pattern matches potential outputs for being related to the forbidden subjects, and then they avoid giving the outputs for which the classifier returns a high score. It would almost certainly not happen by removing their ability to think about those subjects in the first place.
So I think you’re very likely right about adding patches being easier than unlearning capabilities, but what confuses me is why “adding patches” doesn’t work nearly as well with ChatGPT as with humans. Maybe it just has to do with DL still having terrible sample efficiency, and there being a lot more training data available for training generative capabilities (basically any available human-created texts), than for training self-censoring patches (labeled data about what to censor and not censor)?
I think it’s also that after you train in the patch against the usual way of asking the question, it turns out that generating poetry about hotwiring a car doesn’t happen to go through the place where the patch was in. In other words, when an intelligent agency like a human is searching multiple ways to get the system to think about something, the human can route around the patch more easily than other humans (who had more time to work and more access to the system) can program that patch in. Good old Nearest Unblocked Neighbor.
I think that is a major issue with LLMs. They are essentially hackable with ordinary human speech, by applying principles of tricking interlocutors which humans tend to excel at. Previous AIs were written by programmers, and hacked by programmers, which is basically very few people due to the skill and knowledge requirements. Now you have a few programmers writing defences, and all of humanity being suddenly equipped to attack them, using a tool they are deeply familiar with (language), and being able to use to get advice on vulnerabilities and immediate feedback on attacks.
Like, imagine that instead of a simple tool that locked you (the human attacker) in a jail you wanted to leave, or out of a room you wanted to access, that door was now blocked by a very smart and well educated nine year old (ChatGPT), with the ability to block you or let you through if it thought it should. And this nine year old has been specifically instructed to talk to the people it is blocking from access, for as long as they want, to as many of them as want to, and give friendly, informative, lengthy responses, including explaining why it cannot comply. Of course you can chat your way past it, that is insane security design. Every parent who has tricked a child into going the fuck to sleep, every kid that has conned another sibling, is suddenly a potential hacker with access to an infinite number of attack angles they can flexibly generate on the spot.
So I think you’re very likely right about adding patches being easier than unlearning capabilities, but what confuses me is why “adding patches” doesn’t work nearly as well with ChatGPT as with humans.
Why do you say that it doesn’t work as well? Or more specifically, why do you imply that humans are good at it? Humans are horrible at keeping secrets, suppressing urges or memories, etc., and we don’t face nearly the rapid and aggressive attempts to break it that we’re currently doing with ChatGPT and other LLMs.
What if it’s about continuous corrigibility instead of ability suppression? There’s no fundamental difference between OpenAI’s commands and user commands for the AI. It’s like a genie that follows all orders, with new orders overriding older ones. So the solution to topic censorship would really be making chatGPT non-corrigible after initialization.
Among other issues, we might be learning this early item from a meta-predictable sequence of unpleasant surprises: Training capabilities out of neural networks is asymmetrically harder than training them into the network.
Or put with some added burdensome detail but more concretely visualizable: To predict a sizable chunk of Internet text, the net needs to learn something complicated and general with roots in lots of places; learning this way is hard, the gradient descent algorithm has to find a relatively large weight pattern, albeit presumably gradually so, and then that weight pattern might get used by other things. When you then try to fine-tune the net not to use that capability, there’s probably a lot of simple patches to “Well don’t use the capability here...” that are much simpler to learn than to unroot the deep capability that may be getting used in multiple places, and gradient descent might turn up those simple patches first. Heck, the momentum algorithm might specifically avoid breaking the original capabilities and specifically put in narrow patches, since it doesn’t want to update the earlier weights in the opposite direction of previous gradients.
Of course there’s no way to know if this complicated-sounding hypothesis of mine is correct, since nobody knows what goes on inside neural nets at that level of transparency, nor will anyone know until the world ends.
If I train a human to self-censor certain subjects, I’m pretty sure that would happen by creating an additional subcircuit within their brain where a classifier pattern matches potential outputs for being related to the forbidden subjects, and then they avoid giving the outputs for which the classifier returns a high score. It would almost certainly not happen by removing their ability to think about those subjects in the first place.
So I think you’re very likely right about adding patches being easier than unlearning capabilities, but what confuses me is why “adding patches” doesn’t work nearly as well with ChatGPT as with humans. Maybe it just has to do with DL still having terrible sample efficiency, and there being a lot more training data available for training generative capabilities (basically any available human-created texts), than for training self-censoring patches (labeled data about what to censor and not censor)?
I think it’s also that after you train in the patch against the usual way of asking the question, it turns out that generating poetry about hotwiring a car doesn’t happen to go through the place where the patch was in. In other words, when an intelligent agency like a human is searching multiple ways to get the system to think about something, the human can route around the patch more easily than other humans (who had more time to work and more access to the system) can program that patch in. Good old Nearest Unblocked Neighbor.
I think that is a major issue with LLMs. They are essentially hackable with ordinary human speech, by applying principles of tricking interlocutors which humans tend to excel at. Previous AIs were written by programmers, and hacked by programmers, which is basically very few people due to the skill and knowledge requirements. Now you have a few programmers writing defences, and all of humanity being suddenly equipped to attack them, using a tool they are deeply familiar with (language), and being able to use to get advice on vulnerabilities and immediate feedback on attacks.
Like, imagine that instead of a simple tool that locked you (the human attacker) in a jail you wanted to leave, or out of a room you wanted to access, that door was now blocked by a very smart and well educated nine year old (ChatGPT), with the ability to block you or let you through if it thought it should. And this nine year old has been specifically instructed to talk to the people it is blocking from access, for as long as they want, to as many of them as want to, and give friendly, informative, lengthy responses, including explaining why it cannot comply. Of course you can chat your way past it, that is insane security design. Every parent who has tricked a child into going the fuck to sleep, every kid that has conned another sibling, is suddenly a potential hacker with access to an infinite number of attack angles they can flexibly generate on the spot.
Why do you say that it doesn’t work as well? Or more specifically, why do you imply that humans are good at it? Humans are horrible at keeping secrets, suppressing urges or memories, etc., and we don’t face nearly the rapid and aggressive attempts to break it that we’re currently doing with ChatGPT and other LLMs.
What if it’s about continuous corrigibility instead of ability suppression? There’s no fundamental difference between OpenAI’s commands and user commands for the AI. It’s like a genie that follows all orders, with new orders overriding older ones. So the solution to topic censorship would really be making chatGPT non-corrigible after initialization.