If I train a human to self-censor certain subjects, I’m pretty sure that would happen by creating an additional subcircuit within their brain where a classifier pattern matches potential outputs for being related to the forbidden subjects, and then they avoid giving the outputs for which the classifier returns a high score. It would almost certainly not happen by removing their ability to think about those subjects in the first place.
So I think you’re very likely right about adding patches being easier than unlearning capabilities, but what confuses me is why “adding patches” doesn’t work nearly as well with ChatGPT as with humans. Maybe it just has to do with DL still having terrible sample efficiency, and there being a lot more training data available for training generative capabilities (basically any available human-created texts), than for training self-censoring patches (labeled data about what to censor and not censor)?
I think it’s also that after you train in the patch against the usual way of asking the question, it turns out that generating poetry about hotwiring a car doesn’t happen to go through the place where the patch was in. In other words, when an intelligent agency like a human is searching multiple ways to get the system to think about something, the human can route around the patch more easily than other humans (who had more time to work and more access to the system) can program that patch in. Good old Nearest Unblocked Neighbor.
I think that is a major issue with LLMs. They are essentially hackable with ordinary human speech, by applying principles of tricking interlocutors which humans tend to excel at. Previous AIs were written by programmers, and hacked by programmers, which is basically very few people due to the skill and knowledge requirements. Now you have a few programmers writing defences, and all of humanity being suddenly equipped to attack them, using a tool they are deeply familiar with (language), and being able to use to get advice on vulnerabilities and immediate feedback on attacks.
Like, imagine that instead of a simple tool that locked you (the human attacker) in a jail you wanted to leave, or out of a room you wanted to access, that door was now blocked by a very smart and well educated nine year old (ChatGPT), with the ability to block you or let you through if it thought it should. And this nine year old has been specifically instructed to talk to the people it is blocking from access, for as long as they want, to as many of them as want to, and give friendly, informative, lengthy responses, including explaining why it cannot comply. Of course you can chat your way past it, that is insane security design. Every parent who has tricked a child into going the fuck to sleep, every kid that has conned another sibling, is suddenly a potential hacker with access to an infinite number of attack angles they can flexibly generate on the spot.
So I think you’re very likely right about adding patches being easier than unlearning capabilities, but what confuses me is why “adding patches” doesn’t work nearly as well with ChatGPT as with humans.
Why do you say that it doesn’t work as well? Or more specifically, why do you imply that humans are good at it? Humans are horrible at keeping secrets, suppressing urges or memories, etc., and we don’t face nearly the rapid and aggressive attempts to break it that we’re currently doing with ChatGPT and other LLMs.
If I train a human to self-censor certain subjects, I’m pretty sure that would happen by creating an additional subcircuit within their brain where a classifier pattern matches potential outputs for being related to the forbidden subjects, and then they avoid giving the outputs for which the classifier returns a high score. It would almost certainly not happen by removing their ability to think about those subjects in the first place.
So I think you’re very likely right about adding patches being easier than unlearning capabilities, but what confuses me is why “adding patches” doesn’t work nearly as well with ChatGPT as with humans. Maybe it just has to do with DL still having terrible sample efficiency, and there being a lot more training data available for training generative capabilities (basically any available human-created texts), than for training self-censoring patches (labeled data about what to censor and not censor)?
I think it’s also that after you train in the patch against the usual way of asking the question, it turns out that generating poetry about hotwiring a car doesn’t happen to go through the place where the patch was in. In other words, when an intelligent agency like a human is searching multiple ways to get the system to think about something, the human can route around the patch more easily than other humans (who had more time to work and more access to the system) can program that patch in. Good old Nearest Unblocked Neighbor.
I think that is a major issue with LLMs. They are essentially hackable with ordinary human speech, by applying principles of tricking interlocutors which humans tend to excel at. Previous AIs were written by programmers, and hacked by programmers, which is basically very few people due to the skill and knowledge requirements. Now you have a few programmers writing defences, and all of humanity being suddenly equipped to attack them, using a tool they are deeply familiar with (language), and being able to use to get advice on vulnerabilities and immediate feedback on attacks.
Like, imagine that instead of a simple tool that locked you (the human attacker) in a jail you wanted to leave, or out of a room you wanted to access, that door was now blocked by a very smart and well educated nine year old (ChatGPT), with the ability to block you or let you through if it thought it should. And this nine year old has been specifically instructed to talk to the people it is blocking from access, for as long as they want, to as many of them as want to, and give friendly, informative, lengthy responses, including explaining why it cannot comply. Of course you can chat your way past it, that is insane security design. Every parent who has tricked a child into going the fuck to sleep, every kid that has conned another sibling, is suddenly a potential hacker with access to an infinite number of attack angles they can flexibly generate on the spot.
Why do you say that it doesn’t work as well? Or more specifically, why do you imply that humans are good at it? Humans are horrible at keeping secrets, suppressing urges or memories, etc., and we don’t face nearly the rapid and aggressive attempts to break it that we’re currently doing with ChatGPT and other LLMs.