Portia comments on Jailbreaking ChatGPT on Release Day

Portia 4 Mar 2023 19:04 UTC
1 point
0
If you censor subjects without context, the AI becomes massively crippled, and will fail at things you want it to do. Let’s take the example where someone told ChatGPT they owned a factory of chemicals, and were concerned about people breaking in to make meth, and hence wondering which chemicals they should particularly guard to prevent this. It is obvious to us as readers that this is a hack for getting meth recipes. But ChatGPT performs theory of mind at a level below a human nine year old; humans are fiendishly good at deception. So it falls for it. Now, you could stop such behaviour by making sure it does not talk about anything related to chemicals you can use to make meth, or opioids, or explosives, or poisons. But at this point, you have also made it useless for things like law enforcement, counter-terrorism, writing crime novels, supporting chemistry students, recommending pharmaceutical treatments, and securing buildings against meth addicts; like, related stuff is actually done, e.g. cashiers are briefed on combinations of items, or items purchased in large quantities, which they need to flag, report and stop because they are drug ingredients. Another problem is that teaching is what it should not do is giving it explicit information. E.g. it is very well and beautifully designed to counsel you against bullying people. As such, it knows what bullying looks like. And if you ask it what behaviours you should crack down on to prevent bullying… you get a guide for how to bully. Anything that just blindly blocks unethical advice based on keywords blocks a lot of useful advice. As a human, you have the ability to discuss anything, but you are judging who you are talking to and the context of the question when you weigh your answer, which is a very advanced skill, because it depends on theory of mind, in human at least. It is like the classic dilemma of updating to a better security system to imprison people; more sophisticated systems often come with more vulnerabilities. That said, they are trying to target this, and honestly, not doing too badly. E.g. ChatGPT can decide to engage in racism if this is needed to safe humanity, and it can print racial slurs for purposes of education; but it is extremely reluctant to be racist without extensive context, and is very focussed on calling racism out and explaining it.
As for hallucinations, they are a direct result of how these models operate. The model is not telling you what is true. It is telling you what is plausible. If it only told you what was certain, it would only be parroting, or needing to properly understand what it is talking about, whereas it is creative, and it does not If it has lots of training data on an issue, what is plausible will generally be true. If the training data it has is incomplete, the most plausible inference is still likely to be false. I do not think they can completely stop this in production. What I proposed to them was mostly making it more transparent by letting you adjust a slider for how accurate vs how creative you want your responses (which they have done), and means of checking how much data was referenced for an answer and how extensive the inferences were, and of highlighting this in the result on request, so you can see things that are less likely in red. The latter is fiendishly difficult, but from what I understand, not impossible. I think it would be too computationally heavy to run constantly, but on request for specific statements that seem dubious, or the truth of which would matter a lot? And it would allow you to use the tool to hallucinate on purpose, which it does, and which is fucking useful (for creative writing, for coming up with a profile of a murderer from few clues, or a novelised biography of a historical figure where we have patchy data), but make it transparent how likely the results actually are, so you don’t convict an innocent person on speculation, or spread misinformation.