They explicitly look for their test problems in the training set, and find very few examples, suggesting that the model really is learning “how to do addition”; further, when it is incorrect, it tends to make mistakes like “forgetting to carry a 1”.
Or is it learning how to imitate doing addition? (And the training set had people making those mistakes so it copies them.)
why does it need 50 examples of addition in order to figure out that what it should do is addition?
Sometimes people mention 2+2=4 when discussing truth, and then don’t bring up arithmetic again. (If the training doesn’t include ‘pure’ contexts, then assuming a context is pure might require a basis.)
Style:
and say they about these issues what you’d expect.
and they say about these issues what you’d expect?
Or is it learning how to imitate doing addition? (And the training set had people making those mistakes so it copies them.)
The arithmetic section says that they checked the corpus and found that most of the tested arithmetic they tried has no representations in the corpus:
Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized.
If it’s not memorizing/learning from explicit examples (but mostly from numbers used in normal ways and analytic writing), there can hardly be many explicit examples of simple arithmetic which are wrong, either, so ‘imitating errors’ seems unlikely.
One critique is that GPT-3 still takes far too long to “identify” a task—why does it need 50 examples of addition in order to figure out that what it should do is addition? Why isn’t 1 sufficient? It’s not like there are a bunch of other conceptions of “addition” that need to be disambiguated
Check it out: 2500 is the four-digit chunk ’2500′. 3500 is the digits ‘35’ followed by the digits ‘00’. And 4500 is the digit ‘4’ followed by the digits ‘500’.
As we head further into 4-digit numerals, we start seeing 3-chunk ones eventually. The first 3-chunk numeral is (place your bets…) 4761 = “ 4″ + “76″ + “1″ (did you guess it?). The next is 4791, then 4861, 4862, 4863, then 4881, and so on in another inscrutable integer sequence.
Unlike 2-chunking, though, 3-chunking is consistent about where to split. It’s always first digit + middle two + last digit. This holds across the whole range from the 4761, the first 4-digit / 3-chunk number, to 9984, the last 4-digit / 3-chunk number. Among 4-digit numbers overall, 2.0% are 1 chunk, 95.7% are 2 chunks, and 2.4% are 3 chunks.
… got that?
GPT-3 isn’t learning ‘arithmetic’, it’s learning ‘arithmetic on 2-chunking’, ‘arithmetic on 2-chunking when overflowing to 3-chunking’, ‘arithmetic on 3-chunking’, ‘arithmetic on 3-chunking overflowing to 1-3 chunks’...
The space of all possible algorithms one could run on three-digit-addition-strings like “218+375” seems rather vast. Could it be that what GPT3 is doing is something like
generating a large bunch of candidate algorithms, and
estimating the likelihoods of those algorithms given the examples, and
doing something like a noisy/weak Bayesian update, and
executing one of the higher-posterior algorithms, or some “fuzzy combination” of them?
Obviously this is just wild, vague speculation; but to me it intuitively seems like it would at least sort of answer your question. What do you think? (Could GPT3 be doing something like the above?)
(To a human, it might feel like [the actual algorithm for addition] is a glaringly obvious candidate. But, on something like a noisy simplicity prior over all possible string-manipulations-algorithms, [the actual algorithm for addition] maybe starts looking like just one of the more conspicuous needles in a haystack?)
That seems far too structured to me—I seriously doubt GPT-3 is doing anything like “generating a large bunch of candidate algorithms”, though maybe it has learned heuristics that approximate this sort of computation.
Content:
Or is it learning how to imitate doing addition? (And the training set had people making those mistakes so it copies them.)
Sometimes people mention 2+2=4 when discussing truth, and then don’t bring up arithmetic again. (If the training doesn’t include ‘pure’ contexts, then assuming a context is pure might require a basis.)
Style:
and they say about these issues what you’d expect?
The arithmetic section says that they checked the corpus and found that most of the tested arithmetic they tried has no representations in the corpus:
If it’s not memorizing/learning from explicit examples (but mostly from numbers used in normal ways and analytic writing), there can hardly be many explicit examples of simple arithmetic which are wrong, either, so ‘imitating errors’ seems unlikely.
To update on this: I think that actually, there are a lot of ‘additions’ which are ambiguous in GPT-3, because of the bizarreness of the BPE representation of numbers. I’ve discussed how empirically arithmetic & other tasks improve when avoiding BPEs, but perhaps more useful is to look at the task of ‘addition’ on BPEs directly. Nostalgebraist helpfully provides some tables of BPEs/numbers: https://nostalgebraist.tumblr.com/post/620663843893493761/bpe-blues
GPT-3 isn’t learning ‘arithmetic’, it’s learning ‘arithmetic on 2-chunking’, ‘arithmetic on 2-chunking when overflowing to 3-chunking’, ‘arithmetic on 3-chunking’, ‘arithmetic on 3-chunking overflowing to 1-3 chunks’...
Seems like a plausible argument for having > 1 example, but surely 5 would be enough. Why does it keep improving after 5?
See Gwern’s comment.
The space of all possible algorithms one could run on three-digit-addition-strings like “218+375” seems rather vast. Could it be that what GPT3 is doing is something like
generating a large bunch of candidate algorithms, and
estimating the likelihoods of those algorithms given the examples, and
doing something like a noisy/weak Bayesian update, and
executing one of the higher-posterior algorithms, or some “fuzzy combination” of them?
Obviously this is just wild, vague speculation; but to me it intuitively seems like it would at least sort of answer your question. What do you think? (Could GPT3 be doing something like the above?)
(To a human, it might feel like [the actual algorithm for addition] is a glaringly obvious candidate. But, on something like a noisy simplicity prior over all possible string-manipulations-algorithms, [the actual algorithm for addition] maybe starts looking like just one of the more conspicuous needles in a haystack?)
That seems far too structured to me—I seriously doubt GPT-3 is doing anything like “generating a large bunch of candidate algorithms”, though maybe it has learned heuristics that approximate this sort of computation.