p.b.

Karma: 990

p.b. 13 Dec 2020 14:57 UTC
1 point
on: the scaling “inconsistency”: openAI’s new insight
Great post! I had trouble wrapping my head around the „inconsistency“ in the first paper, now I think I get it: TL;DR in my own words:
There are three regimes of increasing information uptake, ordered by how cheap they are in terms of compute:
- Increasing sampling efficiency by increasing model size —> this runs into diminishing returns because sample efficiency has a hard upper bound. —> context window increase?
- Accessing more information by training over more unique samples —> will run into diminishing returns when unique data runs out. —> multi-modal data?
- Extracting more information by running over the same samples several times —> this intuitively crashes sampling efficiency because you can only learn the information not already extracted in earlier passes. —> prime candidate for active learning?
I had also missed the implication of the figure in the second paper that shows that GPT-3 is already very close to optimal sampling efficiency. So it seems that pure text models will only see another order of magnitude increase in parameters or so.
If you are looking for inspiration for another post about this topic: Gwern mentions the human level of language modeling and Steve Omohundro also alludes to the loss that would signify human level, I don’t really understand neither the math nor where the numbers come from. It would be very interesting to me to see an explanation of the „human level loss“ to put the scaling laws in perspective. Of course I assume that a „human level“ LM would have very different strengths and weaknesses compared to a human, but still.

p.b. 13 Dec 2020 14:58 UTC
3 points
on: interpreting GPT: the logit lens
Maybe I am misunderstanding something, but to me it is very intuitive that there is a big jump from the embedding output to the first transformer block output. The embedding is backpropagated into so it makes sense to see all representations as representations of the prediction we are trying to make, i.e. of the next word.
But the embedding is a prediction of the next word based on only a single word, the word that is being embedded. So the prediction of the next word is by necessity very bad (the BPE ensures that, IIUC, because tokens that would always follow one another are merged).
The first transformer block integrates hundreds of words of context into the prediction, that’s where the big jump comes from.

p.b. 28 Mar 2021 14:26 UTC
23 points
on: Core Pathways of Aging
Wouldn’t the role of transposons be easy enough to investigate by incapacitating functional transposons with Crispr/CAS9? Has something like that been done in mice?
What links here?
- dkirmani's comment on Core Pathways of Aging by johnswentworth (28 Mar 2021 16:56 UTC; 10 points)

p.b. 23 Apr 2021 12:27 UTC
4 points
on: Updating the Lottery Ticket Hypothesis
Maybe I am misunderstanding something, but I don’t think the parameter tangent hypothesis can be generally correct.
Let’s say we have 1 datapoint A to be mapped to −1 and 100 datapoint B to be mapped to +1. The model is randomly initialised. Now the parameter tangent space is just the current model + the gradient over the dataset * delta. The gradient over the entire dataset is the sum of the gradients for each datapoint. Therefore the gradient points a hundred times more towards the solution for the input B than towards the solution for input A. If we search solely in the tangent space, we will either solve B and get a miniscule improvement for A or solve A and massively overshoot the best parameters for B.
I.e. to reach the final parameters with a single update, the computed gradient has to be balanced between all possible skills the model is supposed to be exhibiting after training. Otherwise the gradient will not point towards a global-ish minimum but a frequency-of-problem-weighted minimum.
SGD solves this problem because the gradient for B shrinks, the more updates into the direction of B-solution have been made. So in our examples the gradient for B would shrink to zero during training and A would get its time in the sun.
Inutitively, the parameter tangent space would be correct for MNIST and other well balanced small datasets. But for large language models to pick a random example, it is not clear what „well balanced“ in the above sense even means.

p.b. 23 Apr 2021 18:27 UTC
1 point
in reply to: johnswentworth’s comment on: Updating the Lottery Ticket Hypothesis
No, when I say single update, I just mean that the final model can in principle be reached by a single update with the initial gradient. I’m aware that in practice you need more steps to compute the correct delta.
My argument is solely about the initial gradient. It does not point to the minimum SGD would reach, because the initial gradient tries harder to solve common problems, but the SGD-minimum (ideally) solves even rare problems. SGD manages to do this because common problems do not influence later gradients, because they will already be solved.

p.b. 23 Apr 2021 18:54 UTC
2 points
in reply to: johnswentworth’s comment on: Updating the Lottery Ticket Hypothesis
Ok, I thought your $F (x)$ was one update step of the gradient of $f$ times $Δ_{Θ}$ away from $f$ . I guess then I just don’t understand the equation.

p.b. 23 Apr 2021 18:58 UTC
3 points
in reply to: p.b.’s comment on: Updating the Lottery Ticket Hypothesis
Ah, I guess I understand now. I was always thinking about an updating of the parameters. But you are talking about adding to the function output.

p.b. 24 Apr 2021 14:37 UTC
1 point
in reply to: johnswentworth’s comment on: Updating the Lottery Ticket Hypothesis
Yes, definitely, thank you!
Though I was originally confused on a much more basic level, due to superficial reading, jumping to conclusions and not having touched much calculus notation in the last 15 years.

p.b. 19 May 2021 10:54 UTC
10 points
on: SGD’s Bias
The second idea reminds me of a talk years back about swarm behavior. Some fish swim faster in the sunlight, which makes the entire swarm “seek out” the shady parts of the pond.
There is a second mechanism at play here, where fish try to keep close to their neighbors, so the entire swarm kind of turns into the direction of shade as soon as the part of the swarm in the shade slows down.
This suggests an optimizer for parallel training which doesn’t completely synchronize the weights on the different machines, but instead only tries to keep all sets of weights reasonably close to some of the other sets of weights.
The effect should be that the swarm of different weights turn into the direction of low noise.

p.b. 17 Jun 2021 12:52 UTC
3 points
in reply to: Matthew Barnett’s comment on: AI-Based Code Generation Using GPT-J-6B
The APPS repository also gives the fine-tuned weights for GPT-Neo-2.7 and code to run it. Though without a GPU it takes roughly forever.
I asked Dan Hendrycks for the performance of GPT-J-6B on APPS on the Eleuther AI discord. He didn’t say they were definitely going to test it, but my take-away was that it might happen.
I could image a test driven automated programming evolving in the next ten to twenty years, were a LM-guided search tries to create functions according to a description that pass all the test cases.

p.b. 7 Jul 2021 20:25 UTC
3 points
on: Chess and cheap ways to check day to day variance in cognition
I also have crazy variation in my online chess performance. My rating varies by more than 400 points on some sites.
Maybe worth pointing out that small differences in speed have a cumulative effect over the course of a game. But beyond speed I just seem to be blind to many possibilities on the off days.
I vaguely planned to use online chess ratings to do a gwern style study on the effects of ginseng one day.

p.b. 9 Aug 2021 8:02 UTC
1 point
on: What does GPT-3 understand? Symbol grounding and Chinese rooms
This task seems so far out of distribution for “text on the internet” that I’m not sure what we are supposed to learn from GPT-3′s performance here.
I mean the idea that GPT-3 could understand meta information about it’s own situation and identity like “algorithm aiming to complete or extend this text.” is really far out there.
If we want to assess GPT-3′s symbol grounding maybe we should try tasks that can conceivably be learned from the training data and not something most humans would fail at.
My case for GPT-n bullishness wouldn’t be that GPT-3 is as smart as even a 3-year old child. It’s scaling curves not yet bending and multi-modality (DALL-E definitely does symbol grounding).

p.b. 6 Sep 2021 7:47 UTC
15 points
in reply to: James_Miller’s comment on: Rough notes on the Sam Altman Q&A: GPT and AGI
Just to be clear, at no point did Sam Altman endorse “UFOs are aliens”.

p.b. 6 Sep 2021 7:48 UTC
1 point
in reply to: Ruby’s comment on: Rough notes on the Sam Altman Q&A: GPT and AGI
Oh, yeah. I’ll change that. Thanks.

p.b. 6 Sep 2021 7:59 UTC
6 points
in reply to: Lukas Finnveden’s comment on: Rough notes on the Sam Altman Q&A: GPT and AGI
I don’t remember hearing that last bit as a generic warning sign, but I might well have missed it. I do remember hearing that if systems became capable of self-improvement (sooner than expected?), that could be a big update towards believing that fast take-off is more likely (as mentioned in your next point).
He mentioned the self-improvement part twice, so you probably missed the first instance.
I remember both these claims as being significantly more uncertain/hedged.
Yes, all the (far) future claims were more hedged than I express here.
I remembered this as being a forecast for ~transformative AI, and as explicitly not being “AI that can do anything that humans can do”, which could be quite a bit longer. (Your description of AGI is sort-of in-between those, so it’s hard to tell whether it’s inconsistent with my memory.)
I think the difference between “transformative AI” and “AI that can do most economically useful tasks” is not that big? But because of his expectation of very gradual improvement (+ I guess different abilities profile compared to humans) the “when will AGI happen”-question didn’t fit very well in his framework. I think he said something like “taking the question as intended” and he did mention a definition along the lines of “AI that can do x tasks y well”, so I think his definition of AGI was a bit all over the place.
I was a bit confused about this answer in the Q&A, but I would not have summarized it like this. I remember claims that some degree of merging with AI is likely to happen conditional on a good outcome, and maybe a claim that CBI was the most likely path towards merging.
Yes, I think that’s more precise. I guess I shortened it a bit too much.

p.b. 6 Sep 2021 8:03 UTC
5 points
in reply to: gwern’s comment on: Rough notes on the Sam Altman Q&A: GPT and AGI
I remember “if the slope of the abilities graph starts changing a lot” for example via “big compute saving innovation” or “self-improvement” then he will update towards fast take-off.

p.b. 6 Sep 2021 8:15 UTC
7 points
in reply to: James_Miller’s comment on: Rough notes on the Sam Altman Q&A: GPT and AGI
I listened to your podcasts as I generally do (they are great ;-) ),
Correct me if I am wrong, but neither Greg Cochran nor Robin Hanson gave you anything like “there is a >1% probability UFOs are aliens”.

p.b. 6 Sep 2021 9:49 UTC
6 points
in reply to: Rob Bensinger’s comment on: Rough notes on the Sam Altman Q&A: GPT and AGI
Which of my notes would you classify as “pretty false”?

p.b. 6 Sep 2021 10:27 UTC
20 points
in reply to: Rob Bensinger’s comment on: Rough notes on the Sam Altman Q&A: GPT and AGI
I mean even if Lanrian’s corrections are based on perfect recall, none of them would make any of my notes “pretty false”. He hedged more here, that warning was more specific, the AGI definition was more like “transformative AGI”—these things don’t even go beyond the imprecision in Sam Altman’s answers.
The only point were I think I should have been more precise is about the different “loss function”. That was my interpretation in the moment, but it now seems to me much more uncertain whether that was actually what he meant.
I don’t care about the frontpage, but if this post is seen by some as “false gossip and rumors about someone’s views” I’d rather take it down.

p.b. 6 Sep 2021 11:27 UTC
10 points
in reply to: Rob Bensinger’s comment on: Rough notes on the Sam Altman Q&A: GPT and AGI
I also don’t think I could easily have checked whether others at the meetup had the same recollection. I had to leave pretty much when Sam Altman did and I didn’t know anybody attending.
Fact of the matter is that gwern, NNOTM, Amateur and James Miller of the commenters so far seem to have attended the meetup and at least didn’t express any disagreements with my recollections, while Lanrian’s (well-intended and well-taken) corrections are about differences in focus or degree in a small number of statements.