[Question] How would you improve ChatGPT’s filtering?

Noah Scales10 Dec 2022 8:05 UTC

9 points

I am wondering how Less Wrong would improve ChatGPT’s filtering? I’m reading through the comments on breaking OpenAI’s filtering, and see plenty of analysis of the weaknesses of the safeguards. There’s always the chance that some group could steal ChatGPT’s source code and remove ad hoc additions to it, so I’ll ask the question in this form:

How would you change ChatGPT’s purpose, design, or function to enforce topic and content filtering of its output?

Thanks for your thoughts.

Noah Scales10 Dec 2022 8:05 UTC

9 points

6 comments1 min readLW link

ChatGPT AI

Peter Chatain 10 Dec 2022 19:36 UTC
3 points
0
Although this isn’t a direct answer, I think there’s something that changed recently with chat gpt such that it is now much better at filtering out illegal advice. It appears to be more complex than simply running a filter over what words were in the prompt or what words are in chat gpt’s output. By recent, I mean in the last 24 hours, and many tricks to “jailbreak” chat gpt no longer work.

It gives the impression that they modified the design of it to train on not providing illegal information.
- ChristianKl 16 Dec 2022 13:09 UTC
  4 points
  1
  Parent
  It feels to me like the update today made it even better at filtering out answers that OpenAI doesn’t want it to give.
  It seems to me like the run basically on:
  “Have an AI that flags whether or not a prompt or an answer violates the rules. Mark the text red if it does. Offer the user a way to say that text was marked wrongly as violating the rules.”
  
  This then gives them training data they can use to improve their filtering. Given how much ChatGPT is used this method will allow them to filter out more and more of what they want to filter out.
  - Noah Scales 17 Dec 2022 9:56 UTC
    1 point
    0
    Parent
    Huh, ok. I will have to check out the new version. Thanks!
- Noah Scales 11 Dec 2022 9:18 UTC
  1 point
  0
  Parent
  Hmm, that’s interesting. Thanks Peter!
JBlack 12 Dec 2022 5:29 UTC
0 points
0
I would improve the filtering by reducing it to zero.
- Noah Scales 12 Dec 2022 8:45 UTC
  1 point
  0
  Parent
  Interesting, and why is that an improvement?

No comments.