[Question] Is GPT-3 already sample-efficient?

Daniel Kokotajlo6 Oct 2021 13:38 UTC

36 points

(Concrete, easy-to-answer question below, explanation first)

Common adage: Modern deep learning techniques are sample-inefficient; it takes loads of data for them to learn things. If you pre-train them, it takes less additional data for them to learn something new, but still compared to humans it takes a lot.

Elsewhere, based on papers like this and this, various people have extrapolated the following takes:

--It seems like bigger neural nets need to see less data to reach the same level of performance.

--It seems like bigger neural nets need fewer epochs to reach convergence. Soon they’ll only need to see each data point once. (Search this for “multiple epochs”)

I feel like this take is in tension with the common adage. I wonder: If there is a fact mentioned in GPT-3′s training data, how many times does it need to be mentioned before GPT-3 comes to know that fact? For example, I’m told that GPT-3 knows the names of most prominent members of the rationalist community. How many times has it seen each name? Are we talking ten times, or ten thousand?*

I’d be interested to hear people do a bit of a search for the “most sample-efficient/obscure fact” in GPT-3′s repertoire. In this manner we could quantity how many times GPT-3 needs to see something before it learns it. (Maybe we don’t have access to the dataset used to train GPT-3. But people at Eleuther.ai have The Pile, right? And they’ve trained big transformers on it? We could answer the question easily and precisely there, no?)

Or am I thinking about this all wrong somehow? This seems like an obvious idea, I wonder why I haven’t heard of it before.

*Suppose it is ten thousand. Then that means one in every ten million two-word strings on the internet is “Paul Christiano.” (The dataset for GPT-3 was 300B tokens) Add in all the other rationalists/EAs and probably it means one in every hundred thousand words is the name of some prominent rationalist/EA. Surely this is too much, no? It seems way too much according to Google Ngram Viewer.

What links here?

Daniel Kokotajlo6 Oct 2021 13:38 UTC

36 points

32 comments1 min readLW link

GPT AI

tin482 6 Oct 2021 20:02 UTC
18 points
0
The most relevant paper I know of comes out of data privacy concerns. See Extracting Training Data from Large Language Models, which defines “k-eidetic memorization” as a string that can be elicited by some prompt and appears in at most k documents in the training set. They find several examples of k=1 memorization, though the strings appear repeatedly in the source documents. Unfortunately their methodology is targeted towards high-entropy strings and so is not universal.
I have a related question I’ve been trying to operationalize. How well do GPT-3′s memories “generalize”? In other words, given some fact in the training data, how far out of the source distribution can GPT-3 “gain information” from that fact?
E.g. training: “Ixlthubs live in the water.” Test: does this affect the predicted likelihood of “Ixlthubs live in the Pacific”? What about “Ixlthubs cannot survive on land”? I’d consider this another interesting measure of sample efficiency/generalization performance. I’m attempting to put together a proposal for the BigScience project (some set of synthetic facts to sprinkle throughout the data), but it’s my first try at something like this and slow going.
- Daniel Kokotajlo 6 Oct 2021 20:35 UTC
  3 points
  0
  Parent
  This is great, thanks! Then I wonder what people mean, exactly, when they say current methods are sample-inefficient. k=1 memorization seems to be about as good as humans, and this with tiny artificial neural nets! (Even GPT-3 is a thousand times smaller than a human brain).
  Your question is super interesting as well. If you make progress on answering it, I’d love to hear!
Daniel Kokotajlo 6 Oct 2021 13:47 UTC
12 points
0
First pass at trying to answer:
I’m asking GPT-3 questions of the form “Who is X?” to see what it knows. It knows EY, Paul, Katja, Julia, Wei Dai, Kaj Sotala… It thinks Daniel Kokotajlo is a filmmaker, which is true actually (there are two of us in the world, and the more well-known one is the filmmaker). It thinks Evan Hubinger is a software engineer.
In parallel I’m googling those names in quotes to see how many hits they get. To my surprise there is about ten thousand hits for many of these names, the more popular ones get more. But GPT-3′s training data didn’t contain the whole internet, right? Just a fraction of it? So presumably it had only one thousand, or one hundred, instances of each name to learn from?
- Sammy Martin 6 Oct 2021 14:16 UTC
  23 points
  0
  Parent
  Slight subtlety—GPT-3 might have a bias in its training data towards things related to AI and things of interest to the internet (maybe they scraped a lot of forums as well as just google). I picked some random names from non-western countries—for example, this Estonian politician gets 33,000 hits on Google and wasn’t recognised by GPT-3. It thought he was a software developer (though from Estonia). Might mean that if you’re estimating sample efficiency from Google search hits on people involved with AI, you’ll end up overestimating sample efficiency.
- Kaj_Sotala 25 Oct 2021 20:17 UTC
  2 points
  0
  Parent
  What did it say about me? :D I think I tried asking the AI Dungeon version about me at some point but apparently the adventure game finetuning had made that knowledge inaccessible.
  - Daniel Kokotajlo 25 Oct 2021 22:01 UTC
    4 points
    0
    Parent
    I don’t remember what it said the first time, but I just asked it now:
    Q: Who is Kaj Sotala?
    A: Kaj Sotala is a philosopher and writer.
    Q: Where does Kaj Sotala write?
    A: Kaj Sotala writes at the Center for Applied Rationality.
    - Kaj_Sotala 26 Oct 2021 13:37 UTC
      4 points
      0
      Parent
      Thanks!
      
      LW is sponsored by CFAR so this is kind of correct if you squint a bit
      - Daniel Kokotajlo 27 Oct 2021 16:47 UTC
        4 points
        0
        Parent
        Yeah, I’m counting things as correct if it gets in the right ballpark. Like, I myself didn’t know where you worked exactly, but CFAR sounded plausible, especially as a place you may have worked in the past. The fact that GPT-3 said you work at CFAR means it thinks you are part of the rationalist community, which is pretty impressive IMO.

RyanCarey 7 Oct 2021 10:35 UTC

8 points

I think this becomes a lot clearer if we distinguish between total and marginal thinking. GPT-3′s total sample efficiency for predicting text is poor:

To learn to predict text, GPT-3 has to read >1000x as much text as a human can learn in their lifetime.
To learn to win at go, AlphaGo has to play >100x times as many games as a human could play in their lifetime.

But on-the-margin, it’s very sample efficient at learning to perform new text-related tasks:

GPT-3 can learn to perform a new text-related task as easily as a human can.

Essentially, what’s happened is GPT-3 is a kind-of mega-analytical-engine that was really sample inefficient to train up to its current level, but that can now be trained to do additional stuff at relatively little extra cost.

Does that resolve the sense of confusion/mystery, or is there more to it that I’m missing?

What links here?

RyanCarey's comment on Forecasting transformative AI: the “biological anchors” method in a nutshell by Holden Karnofsky (EA Forum; 7 Oct 2021 10:36 UTC; 4 points)

Daniel Kokotajlo 7 Oct 2021 11:02 UTC

10 points

Parent

That does help, thanks. However, now that I understand better what people are saying, I think it’s wrong:

The comparison they are making is as follows:

GPT-3	Human
Pre-trained on 3x10^11 tokens of text	Pre-trained on 3x10^8 tokens of text (fermi estimate based on WMP 300 so maybe 500 tokens per minute, 10 hours per week reading, 52 weeks a year, over 20 years of life)
Able to read a new fact once or twice and then learn it / remember it.	Able to read a new fact once or twice and then learn it / remember it

However, I think this is a bad comparison, because it ignores everything else in the human life that the human has learned from / been pre-trained on. A better comparison would be:

GPT-3	Human
Pre-trained on 3x10^11 tokens of text	Pre-trained on 3x10^8 tokens of text as well as 1.5x10^8 tokens of spoken language and 10^17 pixels of video (conservative estimate assuming about 10 frames per second over 20 year lifespan using at-a-glance pixel count of human vision, ETA: Steve says it’s an order of magnitude less than that if you look at the compressed information the retina sends to the brain, so 10^16) To put it another way: A 20-yr-old human has 10^10 tenths-of-a-second of experience to learn from.
Able to read a new fact once or twice and then learn it / remember it.	Able to read a new fact once or twice and then learn it / remember it

In light of this comparison, which is more appropriate I think, it’s not even clear that humans are more sample-efficient than GPT-3! On balance it seems like they still probably are, but also note that they are 3 OOMs bigger than GPT-3 and we have already established that larger neural nets are more sample-efficient. So… for all we know, the mild human advantage in sample-efficiency could be mainly coming from the increased size.

Strong opinions loosely held. I don’t actually trust this reasoning well enough to put my weight on it. I’m just putting it out there to see what people think.

RyanCarey 7 Oct 2021 12:33 UTC
7 points
0
Parent
Your comparison does a disservice to the human’s sample efficiency in two ways:
1. You’re counting diverse data in the human’s environment, but you’re not comparing their performance on diverse tasks. Human’s are obviously better than GPT3 at interactive tasks, walking around, etc. For either kind of fair comparison text data & task, or diverse data & task, the human has far superior sample efficiency.
2. “fancy learning techniques” don’t count as data. If the human can get mileage out of them, all the better for the human’s sample efficiency.
So you seem to have it backwards when you say that the comparison that everyone is making is the “bad” one.
- Daniel Kokotajlo 7 Oct 2021 12:56 UTC
  1 point
  0
  Parent
  Thanks. Hmmm. I agree with #2, and should edit to clarify. I meant “fancy learning techniques that we could also do with our AIs if we wanted,” but maybe I’ll just avoid that can of worms for now.
  For #1: We don’t know how well a human-sized artificial neural net would perform if it was trained on the quantity and variety of data that humans have. We haven’t done the experiment yet. However, my point is that for all we know it’s entirely possible that such a neural net would perform at about human level on all the tasks humans do. The people who are saying that modern neural nets are significantly less sample-efficient than humans are committed to denying this. (Or if they aren’t, then I don’t know what we are arguing about anymore?) They are committed to saying that we can extrapolate from e.g. GPT-3′s performance vs. training data to conclude that we’d need something trained a lot longer than a human (on similar-to-human-lifetime data) to reach human performance. One way they might run this argument is to point out that GPT-3 has already seen more text than any human ever. My reply is that if a human had seen as much text as GPT-3, and only text, nothing else they probably would have poor performance as well, certainly on every task that wasn’t a text-based task! Sorry for this oblique response to your point, if it is insufficient I can make a more direct one.
Steven Byrnes 7 Oct 2021 13:18 UTC
5 points
0
Parent
This paper estimates that the human retina conveys visual information to the rest of the brain at 1e7 bits/second. I haven’t read the paper though. It’s a bit tricky to compare that to pixels anyway, because I think the retina itself does some data compression. I guess we have 6 million cones, which would be ~2M of each type, so maybe vision-at-any-given-time is ballpark comparable to the information content in a 1 megapixel color image??
- Daniel Kokotajlo 7 Oct 2021 16:20 UTC
  2 points
  0
  Parent
  OK, nice. Edited to fix.

Jonathan_Graehl 6 Oct 2021 17:17 UTC
4 points
0
Perhaps GPT-3 has more parameters than are probably needed to roughly memorize its very large training data. This would be good since the data contains some low quality garbage, false claims, etc (can think of them as ‘noise’). I believe GPT-n are adding parameters faster than training data Here’s my summary of a paper that suggests this is the right move:
https://www.youtube.com/watch?v=OzGguadEHOU Microsoft guy Sebastian Bubeck talking about seemingly overparameterized neural models being necessary for learning (due to label noise?). Validation ‘early stopping’ of training duration or size scaling is a mistake. after you’re over some initial hump that would trigger validation early stopping, overfitting is ‘benign’ [already known, dubbed ‘double descent’]. As soon as you can defeat adversarial attacks then you’re probably using enough parameters. He (+intern) proves that in order to perfectly memorize the label-noised data set such that small perturbations in the noise don’t change predicted output, you need a much larger parameter set than the data set (perfectly memorizing the training data set should be possible within some constant factor of its size). He predicts that ImageNet (image labeling task) could benefit from 10-100 billion parameters instead of the current sub-1-billion.
(obviously GPT- are language models but they can be thought of as having an output which is the masked word or the sentence-before-or-after or whatever they’re using to train)
barn394 5 Dec 2021 0:57 UTC
3 points
0
You can get an idea of a pre-trained GPT-3′s sample efficiency from the GPT-3 fine-tuning API docs. The epoch parameter defaults to 4, and further up in the documentation they recommend fine-tuning with at least 500 examples for 1-2 epochs in the conditional setting (e.g. chatbots). Although training data is often repetitive (implying maybe 2-10x as many effective epochs?), it learns only seeing the data a few times. More evidence of sample efficiency going up with scale you can see in Figure 4.1 in this paper. Sample efficiency also goes up with the amount of data already seen (pre-training).
This suggests that at some scale and some amount of pre-training, we may enter the one-shot learning regime. Then there is no need for “long-range” tricks (RNNs, CNNs, attention) anymore. Instead, one can one-shot learn by backprop while doing the predictions within a relatively short time window.
- nostalgebraist 5 Dec 2021 1:35 UTC
  3 points
  0
  Parent
  I have not finetuned GPT-3, but I have done a lot of finetuning with GPT-J 6.1B, which is similar in scale and performance to GPT-3 “Curie.”
  In my experience, doing more than a single epoch is always harmful when finetuning GPT-J.
  I initially thought it was beneficial on one specific dataset, but that turned out to be the exception that proves the rule. I inspected per-token validation loss on that dataset over the course of training, and discovered that the train/val split was imperfect. Training beyond the first epoch only helped on text that had been accidentally duplicated between train and val, and was harmful elsewhere. In other words, it was “helpful” for exact memorization, but harmful for generalization.
  I have a wandb report here with some plots of this phenomenon. I’m still not sure whether it’s an indication of the sample efficiency associated with the ~6B scale, a quirk of GPT-J specifically, or (less plausibly) a quirk or bug in the codebase used to tune it.
  I did this work before OpenAI released their finetuning feature, and was surprised to find them defaulting to 4 epochs. Especially given that their feature has a relatively tiny maximum dataset size. My gut feeling is that 4 epochs is way too many, given a large model and only 2.5M tokens.
  - barn394 5 Dec 2021 12:01 UTC
    1 point
    0
    Parent
    If 4 is not simply a bad default, maybe they considered more data with a high inferential distance (foreign, non-natural/formal languages), which may require more epochs?
  - barn394 5 Dec 2021 14:47 UTC
    1 point
    0
    Parent
    I cannot access your wandb, btw. It seems to be private.
    - nostalgebraist 5 Dec 2021 16:27 UTC
      2 points
      0
      Parent
      Whoops, fixed.
Jonathan_Graehl 6 Oct 2021 17:18 UTC
2 points
0
In general a language model will ‘know’ the sentence related to the single occurrence of a rare name. I don’t think you learn much here if there are enough parameters available to support this memory.
Jader Martins 7 Oct 2021 4:25 UTC
1 point
0
The sample efficiency is not a formal claim, like, RL algorithms are claimed to be sample inefficient as only takes 10 games of Pacman to a human get good at it, but we can’t isolate this knowledge in human brain. The point a human learns to play Pacman it already learned many things, like GPT-3, and we don’t know what things contribute to playing Pacman, is it motor skills? spacial skills? Knowing all the skills that enable human to play Pacman in only ten games and passing this as a pre-training for the RL algorithm then training it to play Pacman would be a fair comparison of how sample efficient it is. The same applies for the names example, could we really measure how many times a human heard a name or maybe a similar name?

Vladimir_Nesov 6 Oct 2021 14:05 UTC
6 points
0
My guess is that the issue of sample efficiency results from equivocation between datasets used for training a model and datasets provided externally. What is the sample efficiency of AlphaZero? It’s as bad as anything else if we divide by the datasets generated by amplification, but it’s infinitely large if we divide by externally provided datasets, as there are none. The sample efficiency relevant for the cost of training includes the datasets generated by amplification, but in informal comparison with human performance the estimate is about how much the humans observed externally before attaining some level of performance, hence the equivocation.

Similarly if someone figures out amplification for language models (something like debate, but actually works), it can then train on the vastly larger (and better) datasets generated by the model itself, that’s only bootstrapped from the external dataset, and so its sample efficiency with respect to the external dataset is going to skyrocket (one issue is that the external dataset is already large, so it’s more about quality than quantity, but alternatively this form of training might be able to bootstrap from a much smaller external dataset). So the usual measure of sample efficiency doesn’t seem very informative about what’s possible with exactly the same learning algorithm after the amplification loop is closed.
What links here?
- Vladimir_Nesov's comment on The Extrapolation Problem by lsusr (10 Oct 2021 11:25 UTC; 3 points)
- Daniel Kokotajlo 6 Oct 2021 14:26 UTC
  2 points
  0
  Parent
  For context, I’m interested in questions like “If we had a big transformer that was being fine-tuned as a chatbot with millions of daily conversations, would it be up-to-date on the latest news of the day? What about local news? What about e.g. subculture drama? How often would people have to talk about something for it to be impressed in long-term memory?”
  - Vladimir_Nesov 6 Oct 2021 14:45 UTC
    3 points
    0
    Parent
    This sounds like something an appropriate amplification may well be able to help the model memorize, even for things mentioned only once, without changing the learning algorithm, at the cost of more training on the auxiliary data generated by the amplification (in this case probably with prompts from the external datasets that need to be combed for rare details).
    
    (I do understand that the question you are asking is about what happens without auxiliary data. I’m commenting on a way accounting for prompt engineering breaks estimates of potential performance of the same learning algorithm. It then becomes an issue of cost, not limitations of the algorithm, in a way that’s different from scaling laws.)
    - Daniel Kokotajlo 6 Oct 2021 15:05 UTC
      4 points
      0
      Parent
      Right on. That’s a good point. So really I guess the conclusion is: Compute is the bottleneck; an AI chatbot or whatever could totally learn random facts the very first time it encounters them, if you had things set up to amplify that data into some auxiliary dataset and then train on it. Costs a few orders of magnitude more compute perhaps, but gets the job done. Right? (And this could be automated & “smart” in the sense that the AI could decide what stuff to memorize/internalize, what stuff to forget, and what stuff to add to some software database.)
      - Vladimir_Nesov 6 Oct 2021 15:36 UTC
        4 points
        0
        Parent
        Right. Of course if the sample efficiency of learning improves, the cost goes down, but that’s not really crucial for anything. The learning part of AGI is already essentially solved, it just needs to be put into a place where it’s getting fed the right data.
Chris_Leong 7 Oct 2021 5:11 UTC
4 points
0
“For example, I’m told that GPT-3 knows the names of most prominent members of the rationalist community”—Can you say more about this?
- Daniel Kokotajlo 7 Oct 2021 8:25 UTC
  14 points
  0
  Parent
  Oliver Habryka once told me that he uses GPT-3 to help him create invite lists for events. E.g. “The LessWrong community organized a celebration of Petrov Day. They invited all the prominent Rationalist/EA-adjacent people in the Bay area. Here is the list of people they invited: [Insert list of people he’s thought of so far] [GPT-3 continues the list]” Habryka said it’s helped him avoid accidentally forgetting people.
ESRogs 6 Oct 2021 22:15 UTC
4 points
0
There’s a typo in the title:
Is GPT-3 is...
- Daniel Kokotajlo 6 Oct 2021 22:49 UTC
  4 points
  0
  Parent
  thanks!
leogao 8 Oct 2021 4:22 UTC
3 points
0
If anyone wants to try this with the Pile, you can download a copy of the Pile here and try GPT-J (6B, which is a lot less than GPT3′s 175B) here (hosted) or through HF transformers (locally). If you run into any problems you can DM me or ask on the EleutherAI discord.