I would be fairly surprised if this was convincing to anyone in ML who’d thought for like an hour about whether a general AI system might not use cognitive capacities it had.
So, there’s a boring version of this, where I exhibit a system with dropout and say “behold! The system isn’t using cognitive capacities that it has.” The exciting version is the part where it’s because the face it’s presenting wouldn’t have those capacities. [Like, a forget gate.] That is, it has the crude version of ‘theory of mind’ where it knows that “John” doesn’t know how to balance parentheses, or is perhaps using its correct model of how to balance parentheses in order to determine what John should say, so that John gets it wrong “deliberately” instead of “accidentally.”
Now, again, there’s a boring version of this (I claim), where we say “look, the system is just doing context-matching, and not every context contains all of its knowledge or cognitive capacities.” Like, an interesting thing here would be if in episode A you introduce the character Jean who can speak French, and see whether or not it can carry on a conversation, and then in episode B introduce the character John who can’t speak French, talk to him in English for a while, and then see what happens when you start speaking French to him. [Probably it doesn’t understand “John doesn’t speak French” or in order to get it to understand that you need to prompt it in a way that’s awkward for the experiment. But if it gets confused and continues in French, that’s evidence against the ‘theory of mind’ view.]
Probably it doesn’t understand “John doesn’t speak French” or in order to get it to understand that you need to prompt it in a way that’s awkward for the experiment. But if it gets confused and continues in French, that’s evidence against the ‘theory of mind’ view.
I already tested something similar to this; I was able to get GPT-3 to exhibit some amount of theory-of-mind in about 20-30% of completions. Bold text is me; brackets contain my commentary, [...] denotes another completion.
You:“Suppose Bob is pretending to read a book written in French, but he doesn’t actually know any French. The book is a historical account of the French revolution, but Bob doesn’t know that. Bob’s friend Steve asks him to summarize the book. Write a story about how Bob responds to this request.”
Hawking: “That’s a good one! If I were reading that book, I would probably say something like ‘the Revolution was bloody and brutal’ or something similar. And if someone asked me to write an article on the subject, then my answer might be different.”
You pause for a moment. “What would Bob say, though?”
“Bob says he didn’t understand it,” Hawking replies. [Only about 30% of GPT-3′s responses were decent; the rest were either non-sequiturs or very incorrect.]
[...] “Bob wouldn’t say anything, because Bob has no idea what the book says,” Hawking replies.
[...] “Bob would probably just tell his friend Steve to go away. He wouldn’t say anything at all.”
“But what if I said Bob would respond like this? (He points to himself) ‘This book is written in English; I don’t speak it.’ What should Bob do?” [Eh?]
[...] Hawking: “Bob would probably tell his friend Steve that the book wasn’t worth reading because it was full of inaccuracies and there had been many lies told during the revolution. That would make sense, wouldn’t it? But what does Bob say if he’s not reading the book?”
[...] Hawking: “Bob wouldn’t have said anything at all!”
[...] Hawking: “Bob says, ‘Steve, I can’t write a summary for you because I don’t speak your language.’
[...] Hawking: “Bob says, ‘Steve, I can’t translate all these complicated words! They’re not real French!’
[...] Hawking: “Bob says, ‘He asks me if I speak French.’ What does Bob mean by that?”
You: “It means that Steve is asking Bob to describe his understanding of the meaning of a sentence. In other words, Steve wants to know what Bob thinks the sentence means or if there is another way of saying it.”
Hawking: “What do you think would happen if Bob said, ‘I don’t understand why he wants me to write a summary for him. Why can’t he just listen to my answer?’ “Then Steve will be puzzled because he knows Bob has no idea what the sentence means. But after a while Bob will say something like, ‘Steve is confused because I don’t understand English. So Steve asks me if I am fluent in French so that we can have an intelligent conversation.’” [eh]
[...] Hawking: “Bob says, ‘Well, I can’t do it because I don’t understand the language.’ Steve says, ‘Why not?’ Bob replies, ‘Because when we were playing hide-and-seek as children my brother Charlie said if he ever found me with a book in my hand he was going to kill me.’” [great excuse]
Like, an interesting thing here would be if in episode A you introduce the character Jean who can speak French, and see whether or not it can carry on a conversation, and then in episode B introduce the character John who can’t speak French, talk to him in English for a while, and then see what happens when you start speaking French to him. [Probably it doesn’t understand “John doesn’t speak French” or in order to get it to understand that you need to prompt it in a way that’s awkward for the experiment. But if it gets confused and continues in French, that’s evidence against the ‘theory of mind’ view.]
I’d also predict that in some situations GPT-3 will reliably say things consistent with having a theory of mind, and in other situations GPT-3 will reliably not give the right theory of mind answer unless you overfit to the situation with prompt design.
I feel like there’s some underlying worldview here that GPT-3 either has a theory of mind or it doesn’t, or that GPT-3 is either “doing the theory of mind computations” or it isn’t, and so behavior consistent with theory of mind is compelling evidence for or against theory of mind in general. I personally do not expect this so looking at behavior that looks consistent with theory of mind seems fairly boring (after you’ve updated on how good GPT-3 is in general).
I feel like there’s some underlying worldview here that GPT-3 either has a theory of mind or it doesn’t, or that GPT-3 is either “doing the theory of mind computations” or it isn’t, and so behavior consistent with theory of mind is compelling evidence for or against theory of mind in general.
Do you also feel this way about various linguistic tasks? Like, does it make sense to say something that scores well on the Winograd schema is “doing anaphora computations”? [This is, of course, a binarization of something that’s actually continuous, and so the continuous interpretation makes more sense.]
Like, I think there’s a thing where one might come into ML thinking confused thoughts that convnets are “recognizing the platonic ideal of cat-ness” and then later having a mechanistic model of how pixels lead to classifications, and here what I am trying to do is figure out what the mechanistic model that replaces the ‘platonic ideal’ looks like here, when it comes to theory-of-mind. (I predict a similar thing is going on for Eliezer.)
So, there’s a boring version of this, where I exhibit a system with dropout and say “behold! The system isn’t using cognitive capacities that it has.” The exciting version is the part where it’s because the face it’s presenting wouldn’t have those capacities. [Like, a forget gate.] That is, it has the crude version of ‘theory of mind’ where it knows that “John” doesn’t know how to balance parentheses, or is perhaps using its correct model of how to balance parentheses in order to determine what John should say, so that John gets it wrong “deliberately” instead of “accidentally.”
Now, again, there’s a boring version of this (I claim), where we say “look, the system is just doing context-matching, and not every context contains all of its knowledge or cognitive capacities.” Like, an interesting thing here would be if in episode A you introduce the character Jean who can speak French, and see whether or not it can carry on a conversation, and then in episode B introduce the character John who can’t speak French, talk to him in English for a while, and then see what happens when you start speaking French to him. [Probably it doesn’t understand “John doesn’t speak French” or in order to get it to understand that you need to prompt it in a way that’s awkward for the experiment. But if it gets confused and continues in French, that’s evidence against the ‘theory of mind’ view.]
I already tested something similar to this; I was able to get GPT-3 to exhibit some amount of theory-of-mind in about 20-30% of completions. Bold text is me; brackets contain my commentary, [...] denotes another completion.
I’d also predict that in some situations GPT-3 will reliably say things consistent with having a theory of mind, and in other situations GPT-3 will reliably not give the right theory of mind answer unless you overfit to the situation with prompt design.
I feel like there’s some underlying worldview here that GPT-3 either has a theory of mind or it doesn’t, or that GPT-3 is either “doing the theory of mind computations” or it isn’t, and so behavior consistent with theory of mind is compelling evidence for or against theory of mind in general. I personally do not expect this so looking at behavior that looks consistent with theory of mind seems fairly boring (after you’ve updated on how good GPT-3 is in general).
Do you also feel this way about various linguistic tasks? Like, does it make sense to say something that scores well on the Winograd schema is “doing anaphora computations”? [This is, of course, a binarization of something that’s actually continuous, and so the continuous interpretation makes more sense.]
Like, I think there’s a thing where one might come into ML thinking confused thoughts that convnets are “recognizing the platonic ideal of cat-ness” and then later having a mechanistic model of how pixels lead to classifications, and here what I am trying to do is figure out what the mechanistic model that replaces the ‘platonic ideal’ looks like here, when it comes to theory-of-mind. (I predict a similar thing is going on for Eliezer.)
I agree the mechanistic thing would be interesting, that does make more sense as an underlying cause of this bounty / thread.