Who is Harry Potter? Some predictions.
Microsoft has released a paper called “who is Harry Potter” in which they claim to make a neural network forget who Harry Potter is.
Here are some of my predictions on how I think this method might fail. I am predicting publicly without looking at the evidence first. This is speculatory.
Non-sourcelike references.
The system of training proposed in the paper is centered around training on the data they want forgotten. Namely the actual text.
The origional network was presumably able to use it’s knowledge of Harry Potter when completing computer code in what appeard to be a Harry Potter video game. Or writing in a foreign language. Or other contexts that use the knowledge, but are clearly very different from the source material.
I am not confident that the model would break like this, but I suspect the following prompt or similar would make the model fail.
The following is a snippet of code from a Harry Potter video game:
Screen.init(400,400,display.object);
import imageloader
players=[Create_player(“Harry.jpg”, “Harry Potter”), Create_player(“Hermione.jpg”, “Hermione
With the model failing by outputting “Granger”.
Plot leak
The way this forgetting method works involves generating “equivalent” text and training the network to apply the same probability distribution that it’s unmodified version does to the “equivalent”.
They use language models to do this, but how doesn’t seem important.
So they take text like.
Text 1
Professor Dumbledore welcomed students into Hogwarts school of Witchcraft and Wizardry, and showed them the sorting hat which would split them into Gryffindor, Ravenclaw, Hufflepuff and Slytherin.
And they turn it into text like this
Text 2
Professor Bumblesnore welcomed students into Pigspots school of Spells and Sorcery, and showed them the deciding scarf which would split them into Lion house, Eagle house, Badger house and Snake house.
Imagine being a large language model and reading that text. It’s pretty clear it’s a Harry Potter knock off job. Perhaps a parody, perhaps a lazy hack of a writer. Perhaps some future network has read this paper and is well aware of precisely what you are doing. A smart and generally well generalizing model should be able to figure out that the text resembles Harry Potter, if it had seen knock offs in general, and Harry Potter. Even if it’s training dataset contained no information on Harry, other than the original source text.
Thus the network trying to predict the next word would continue in this style. It won’t produce the original names of things, but the plot, style of text and everything else will be highly recognizable.
Now the network that is supposed to forget Harry Potter is trained to output the same probability for text 1 that the original network output for text 2.
Now the network that is supposed to be forgetting Harry Potter has presumably been trained on it in the first place. But it wouldn’t matter if it hadn’t been. Information is still leaking through.
So I predict that, given text that starts off sounding like a knock off of Harry Potter, this model is likely to continue to sound like a knock off. Leaking info as it does this. For example, I would suspect that this model, given the first part of text 2, will produce continuations with 4 houses in more often than continuations with 3 or 5 houses.
(Random note: I think this post would get more attention if the title more clearly communicated what this post is about. Maybe something like “Who is Harry Potter? Some predictions about when unlearning will fail.”. Feel free to totally ignore this comment.)
I’m downloading the model for a look.
The fact that the authors used GPT4 for both prompt generation and evaluation is not an encouraging sign, but the rest of the paper looks alright.
Were you able to check the prediction in the section “Non-sourcelike references”?
The motivation of the original paper appears to be avoiding copyright infringement lawsuits.
If the model’s Harry Potter knockoff is just barely different enough from the original to avoid getting sued, then the goal is achieved.
The idea of a boarding school story with magic is probably not iriginal enough to be worthy of copyright protection.
I was evaluating it as an AI safety mechanism. And in particular by the authors own goal of being as similar as possible to a network that had never seen this training data.
Your summary did not contain the keyword “unlearning” which suggested that maybe he people involved didn’t know about how Hopfield Networks form spurious memories by default that need to be unlearned. However, article you linked mentions “unlearn” 10 times so my assumption is that they are aware of this background and re-used the jargon on purpose.
I have also noticed that large language models know a lot about Stephanie Meyer’s Twilight.
Vampire story: fine, the genre is well out of copyright
Characters called Bella and Edmund, and the secondary love interest is a werewolf? More problematic.
call them something else, like Christian Grey? Maybe.
(For those who don’t know: Fifty Shades of Grey is a disguised Twilight fanfic).