Like, an interesting thing here would be if in episode A you introduce the character Jean who can speak French, and see whether or not it can carry on a conversation, and then in episode B introduce the character John who can’t speak French, talk to him in English for a while, and then see what happens when you start speaking French to him. [Probably it doesn’t understand “John doesn’t speak French” or in order to get it to understand that you need to prompt it in a way that’s awkward for the experiment. But if it gets confused and continues in French, that’s evidence against the ‘theory of mind’ view.]
I’d also predict that in some situations GPT-3 will reliably say things consistent with having a theory of mind, and in other situations GPT-3 will reliably not give the right theory of mind answer unless you overfit to the situation with prompt design.
I feel like there’s some underlying worldview here that GPT-3 either has a theory of mind or it doesn’t, or that GPT-3 is either “doing the theory of mind computations” or it isn’t, and so behavior consistent with theory of mind is compelling evidence for or against theory of mind in general. I personally do not expect this so looking at behavior that looks consistent with theory of mind seems fairly boring (after you’ve updated on how good GPT-3 is in general).
I feel like there’s some underlying worldview here that GPT-3 either has a theory of mind or it doesn’t, or that GPT-3 is either “doing the theory of mind computations” or it isn’t, and so behavior consistent with theory of mind is compelling evidence for or against theory of mind in general.
Do you also feel this way about various linguistic tasks? Like, does it make sense to say something that scores well on the Winograd schema is “doing anaphora computations”? [This is, of course, a binarization of something that’s actually continuous, and so the continuous interpretation makes more sense.]
Like, I think there’s a thing where one might come into ML thinking confused thoughts that convnets are “recognizing the platonic ideal of cat-ness” and then later having a mechanistic model of how pixels lead to classifications, and here what I am trying to do is figure out what the mechanistic model that replaces the ‘platonic ideal’ looks like here, when it comes to theory-of-mind. (I predict a similar thing is going on for Eliezer.)
I’d also predict that in some situations GPT-3 will reliably say things consistent with having a theory of mind, and in other situations GPT-3 will reliably not give the right theory of mind answer unless you overfit to the situation with prompt design.
I feel like there’s some underlying worldview here that GPT-3 either has a theory of mind or it doesn’t, or that GPT-3 is either “doing the theory of mind computations” or it isn’t, and so behavior consistent with theory of mind is compelling evidence for or against theory of mind in general. I personally do not expect this so looking at behavior that looks consistent with theory of mind seems fairly boring (after you’ve updated on how good GPT-3 is in general).
Do you also feel this way about various linguistic tasks? Like, does it make sense to say something that scores well on the Winograd schema is “doing anaphora computations”? [This is, of course, a binarization of something that’s actually continuous, and so the continuous interpretation makes more sense.]
Like, I think there’s a thing where one might come into ML thinking confused thoughts that convnets are “recognizing the platonic ideal of cat-ness” and then later having a mechanistic model of how pixels lead to classifications, and here what I am trying to do is figure out what the mechanistic model that replaces the ‘platonic ideal’ looks like here, when it comes to theory-of-mind. (I predict a similar thing is going on for Eliezer.)
I agree the mechanistic thing would be interesting, that does make more sense as an underlying cause of this bounty / thread.