It seems relevant that you had to prompt ChatGPT specifically to notice the error, by telling it that an error existed, and even which paragraph the error was located in. For a system whose primary use-case is generative, it’s important that the system not just have the ability to identify consistency failures, but to avoid generating them in the first place. If the system itself generates stories with inconsistencies, even if it can then point out those inconsistencies when prompted (albeit with substantial amounts of handholding along the way), it seems reasonable to maintain that the system in some sense doesn’t “grok” the distinction in question.
Incidentally, you can give ChatGPT a completely fine and consistent story, tell it to spot an inconsistency, and it’ll happily confabulate one for you. Naturally, it has a bias for plausible-sounding errors, and so if this tendency to prefer plausible errors coincides with a story in which an actual inconsistency exists, it’s quite likely that its response will point out the real inconsistency, since that’s a more plausible error than a confabulated one. In some sense, you could argue that this means it just got lucky (though not entirely lucky, of course, since it obviously needs to be able to recognize the real inconsistency as in some sense “more plausible” than any confabulated ones).
And on the flipside of the coin, there are some domains in which “board vision” is so hard that ChatGPT simply flails around—a key example again being chess. If you give it a corrupted PGN file with illegal moves and ask it to identify the first illegal move (with a caveat being that the first illegal move occurs appreciably far into the game, so it doesn’t e.g. occur within opening theory, which ChatGPT has memorized), it basically never identifies the correct move, and never, ever gives the correct explanation for the why the move is illegal.
You could argue that these are all merely signs that ChatGPT’s understanding hasn’t reached the same level as that of humans (or more generally, an entity with what Marcello calls “board vision”). But it’s also possible to model the situation such that the thing Marcello is calling “board vision” and the thing ChatGPT has are appreciably distinct from each other, in a way that isn’t a mere difference of degree—so that e.g. training bigger models doesn’t necessarily fix the issue. Certainly, my preferred usage of the word “grok” doesn’t usually make room for someone to “grok” something while still consistently erring unless handheld; if a human did that, I’d simply say they didn’t grok the topic in question, and I don’t really see a good reason to alter that standard for ChatGPT.
(Oh, and w.r.t. the point about “training bigger models”: since Sydney/Bing seems like a big deal these days, it seems worth saying explicitly that everything I’ve seen from Sydney remains consistent with the idea that it still, nonetheless, lacks what Marcello is calling “board vision”—at least if you buy the frame under which “board vision” is basically binary; you either have it or you don’t.)
When you say “board vision” what you are really saying is the model needs some kind of mental representation of the world. For example, on a whiteboard, you could have a crude picture of “the world” with stick figures for the people in it, then “the apocalypse as some crude drawing of something bad about the same scale as the world”, then “the world” has no people in it.
Notably this works extremely well for humans. I have found it basically impossible to express the most modestly complex idea without a tool like this. Humans just fail on verbal descriptions above a certain level of complexity. Even when communicating with humans at statistically unlikely intelligence levels. Humans only have “board vision” for the narrow domains they are experts in—in those domains they don’t need a whiteboard.
So you need some type of schema so a large class of hypotheses can be represented (images are probably not a good way, you need a graph structure), and then the model would need to generate it’s outputs in multiple passes, where it constructs this representation then constructs text based on the original prompt + representation, and so on.
It seems relevant that you had to prompt ChatGPT specifically to notice the error, by telling it that an error existed, and even which paragraph the error was located in. For a system whose primary use-case is generative, it’s important that the system not just have the ability to identify consistency failures, but to avoid generating them in the first place. If the system itself generates stories with inconsistencies, even if it can then point out those inconsistencies when prompted (albeit with substantial amounts of handholding along the way), it seems reasonable to maintain that the system in some sense doesn’t “grok” the distinction in question.
Incidentally, you can give ChatGPT a completely fine and consistent story, tell it to spot an inconsistency, and it’ll happily confabulate one for you. Naturally, it has a bias for plausible-sounding errors, and so if this tendency to prefer plausible errors coincides with a story in which an actual inconsistency exists, it’s quite likely that its response will point out the real inconsistency, since that’s a more plausible error than a confabulated one. In some sense, you could argue that this means it just got lucky (though not entirely lucky, of course, since it obviously needs to be able to recognize the real inconsistency as in some sense “more plausible” than any confabulated ones).
And on the flipside of the coin, there are some domains in which “board vision” is so hard that ChatGPT simply flails around—a key example again being chess. If you give it a corrupted PGN file with illegal moves and ask it to identify the first illegal move (with a caveat being that the first illegal move occurs appreciably far into the game, so it doesn’t e.g. occur within opening theory, which ChatGPT has memorized), it basically never identifies the correct move, and never, ever gives the correct explanation for the why the move is illegal.
You could argue that these are all merely signs that ChatGPT’s understanding hasn’t reached the same level as that of humans (or more generally, an entity with what Marcello calls “board vision”). But it’s also possible to model the situation such that the thing Marcello is calling “board vision” and the thing ChatGPT has are appreciably distinct from each other, in a way that isn’t a mere difference of degree—so that e.g. training bigger models doesn’t necessarily fix the issue. Certainly, my preferred usage of the word “grok” doesn’t usually make room for someone to “grok” something while still consistently erring unless handheld; if a human did that, I’d simply say they didn’t grok the topic in question, and I don’t really see a good reason to alter that standard for ChatGPT.
(Oh, and w.r.t. the point about “training bigger models”: since Sydney/Bing seems like a big deal these days, it seems worth saying explicitly that everything I’ve seen from Sydney remains consistent with the idea that it still, nonetheless, lacks what Marcello is calling “board vision”—at least if you buy the frame under which “board vision” is basically binary; you either have it or you don’t.)
When you say “board vision” what you are really saying is the model needs some kind of mental representation of the world. For example, on a whiteboard, you could have a crude picture of “the world” with stick figures for the people in it, then “the apocalypse as some crude drawing of something bad about the same scale as the world”, then “the world” has no people in it.
Notably this works extremely well for humans. I have found it basically impossible to express the most modestly complex idea without a tool like this. Humans just fail on verbal descriptions above a certain level of complexity. Even when communicating with humans at statistically unlikely intelligence levels. Humans only have “board vision” for the narrow domains they are experts in—in those domains they don’t need a whiteboard.
So you need some type of schema so a large class of hypotheses can be represented (images are probably not a good way, you need a graph structure), and then the model would need to generate it’s outputs in multiple passes, where it constructs this representation then constructs text based on the original prompt + representation, and so on.