SoerenMind

Karma: 1,408

SoerenMind Oct 27, 2021, 4:30 PM
1 point
in reply to: SoerenMind’s comment on: Inference cost limits the impact of ever larger models
Perhaps what you meant is that latency will be high but this isn’t a problem as long as you have high throughput. That’s is basically true for training. But this post is about inference where latency matters a lot more.

(It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don’t want to interact with it in real time, even at GPT-3 scale)

SoerenMind Oct 27, 2021, 4:25 PM
1 point
in reply to: TLW’s comment on: Inference cost limits the impact of ever larger models
That would be interesting if true. I thought that pipelining doesn’t help with latency. Can you expand?

Generically, pipelining increases throughput without lowering latency. Say you want to compute f(x) where f is a NN. Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That’s why the latency to compute f(x) is high.

NB, GPT-3 used pipelining for training (in combination with model- and data parallelism) and still the large GPT-3 has higher latency than the small ones in the OA API.

SoerenMind Oct 25, 2021, 8:36 PM
1 point
in reply to: gwern’s comment on: Inference cost limits the impact of ever larger models

No, they don’t. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference)

The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn’t seem that MoEs change the ratio of training to inference cost, except insofar as they’re currently finicky to train.

But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model.

Only if you switch to a dense model, which again doesn’t save you that much inference compute. But as you said, they should instead distill into an MoE with smaller experts. It’s still unclear to me how much inference cost this could save, and at what loss of accuracy.

Either way, distilling would make it harder to further improve the model, so you lose one of the key benefits of silicon-based intelligence (the high serial speed which lets your model do a lot of ‘thinking’ in a short wallclock time).

Paul’s estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise

Fair, that seems like the most plausible explanation.

SoerenMind Oct 24, 2021, 6:24 PM
3 points
in reply to: paulfchristiano’s comment on: Inference cost limits the impact of ever larger models
You may have better info, but I’m not sure I expect 1000x better serial speed than humans (at least not with innovations in the next decade). Latency is already a bottleneck in practice, despite efforts to reduce it. Width-wise parallelism has its limits and depth- or data-wise parallelism doesn’t improve latency. For example, GPT-3 already has high latency compared to smaller models and it won’t help if you make it 10^3x or 10^6x bigger.

SoerenMind Oct 24, 2021, 6:24 PM
1 point
in reply to: paulfchristiano’s comment on: Inference cost limits the impact of ever larger models
As Steven noted, your $1/hour number is cheaper than my numbers and probably more realistic. That makes a significant difference.

I agree that transformative impact is possible once we’ve built enough GPUs and connected them up into many, many new supercomputers bigger than the ones we have today. In a <=10 year timeline scenario, this seems like a bottleneck. But maybe not with longer timelines.

SoerenMind Oct 24, 2021, 5:53 PM
1 point
in reply to: gwern’s comment on: Inference cost limits the impact of ever larger models
you’re missing all the possibilities of a ‘merely human-level’ AI. It can be parallelized, scaled up and down (both in instances and parameters), ultra-reliable, immortal, consistently improved by new training datasets, low-latency, ultimately amortizes to zero capital investment

I agree this post could benefit from discussing the advantages of silicon-based intelligence, thanks for bringing them up. I’d add that (scaled up versions of current) ML systems have disadvantages compared to humans, such as a lacking actuators and being cumbersome to fine-tune. Not to speak of the switching cost of moving from an economy based on humans to one based on ML systems. I’m not disputing that a human-level model could be transformative in years or decades—I just argue that it may not be in the short-term.

SoerenMind Oct 24, 2021, 5:52 PM
9 points
in reply to: gwern’s comment on: Inference cost limits the impact of ever larger models
I broadly agree with your first point, that inference can be made more efficient. Though we may have different views on how much?

Of course, both inference and training become more efficient and I’m not sure if the ratio between them is changing over time.

As I mentioned there are also reasons why inference could become more expensive than in the numbers I gave. Given this uncertainty, my median guess is that the cost of inference will continue to exceed the cost of training (averaged across the whole economy).

I don’t think sparse (mixture of expert) models are an example of lowering inference cost. They mostly help with training. In fact they need so much more parameters that it’s often worth distilling them into a dense model after training. The benefit of the sparse MoE architecture seems to be about faster, parallelizable training, not lower inference cost (same link).

Distillation seems to be the main source of cheaper inference then. How much does it help? I’m not sure in general but e.g. in the Switch Transformer paper (same link again), distilling into a 5x smaller model means losing most of the performance gained by using the larger model. Perhaps that’s why as of May 2021, the OpenAI API does not seem to have a model that is nearly as good as the large GPT-3 but cheaper. (Unless the large GPT-3 is no longer available and has been replaced with something cheaper but equally good.)

(An additional source of cheaper inference is by the way low-precision hardware (https://dl.acm.org/doi/pdf/10.1145/3079856.3080246).)

Inference cost limits the impact of ever larger models

SoerenMindOct 23, 2021, 10:51 AM

42 points

33 comments2 min readLW link

SoerenMind Oct 21, 2021, 12:45 PM
12 points
on: Emergent modularity and safety
Our default expectation about large neural networks should be that we will understand them in roughly the same ways that we understand biological brains, except where we have specific reasons to think otherwise.
Here’s a relevant difference: In the brain, nearby neurons can communicate with lower cost and latency than far-apart neurons. This could encourage nearby neurons to form modules to reduce the number of connections needed in the brain. But this is not the case for standard artificial architectures where layers are often fully connected or similar.

SoerenMind Oct 16, 2021, 11:40 AM
LW: 2 AF: 1
AF
on: NLP Position Paper: When Combatting Hype, Proceed with Caution
Some minor feedback points: Just from reading the abstract and intro, this could be read as a non-sequitur: “It limits our ability to mitigate short-term harms from NLP deployments”. Also, calling something a “short-term” problem doesn’t seem necessary and it may sound like you think the problem is not very important.

SoerenMind Oct 16, 2021, 11:26 AM
6 points
on: NLP Position Paper: When Combatting Hype, Proceed with Caution
The correct link is without the final dot: https://cims.nyu.edu/~sbowman/bowman2021hype.pdf

SoerenMind Sep 13, 2021, 4:45 PM
3 points
on: Prefer the British Style of Quotation Mark Punctuation over the American
One thing I dislike about the ‘punctuation outside quotes’ view is that it treats ”!” and ”?” differently than a full stop.

”This is an exclamation”!
”Is this a question”?

Seems less natural to me than:

”This is an exclamation!”
″Is this a question?”

I think have this intuition because it is part of the quote that it is an exclamation or a question.

SoerenMind Aug 16, 2021, 1:06 PM
1 point
in reply to: Daniel Kokotajlo’s comment on: What 2026 looks like (Daniel’s Median Future)
Yes I completely agree. My point is that the fine-tuned version didn’t have better final coding performance than the version trained only on code. I also agree that fine-tuning will probably improve performance on the specific tasks we fine-tune on.

SoerenMind Aug 11, 2021, 4:10 PM
3 points
in reply to: Daniel Kokotajlo’s comment on: What 2026 looks like (Daniel’s Median Future)
Most importantly I expect them to be fine-tuned on various things (perhaps you can bundle this under “higher-quality data”). Think of how Codex and Copilot are much better than vanilla GPT-3 at coding. That’s the power of fine-tuning / data quality.

Fine-tuning GPT-3 on code had little benefit compared to training from scratch:
Surprisingly, we did not observe improvements when starting from a pre-trained language model, possibly because the finetuning dataset is so large. Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strategy for all subsequent experiments.
I wouldn’t categorize Codex under “benefits of fine-tuning/data quality” but under “benefits of specialization”. That’s because GPT-3 is trained on little code whereas Codex only on code. (And the Codex paper didn’t work on data quality more than the GPT-3 paper.)

SoerenMind Aug 11, 2021, 10:11 AM
3 points
on: What 2026 looks like (Daniel’s Median Future)
2023
The multimodal transformers are now even bigger; the biggest are about half a trillion parameters [...] The hype is insane now

This part surprised me. Half a trillion is only 3x bigger than GPT-3. Do you expect this to make a big difference? (Perhaps in combination with better data?). I wouldn’t, given that GPT-3 was >100x bigger than GPT-2.

Maybe your’e expecting multimodality to help? It’s possible, but worth keeping in mind that according to some rumors, Google’s multimodal model already has on the order of 100B parameters.

On the other hand, I do expect more than half a trillion parameters by 2023 as this seems possible financially, and compatible with existing supercomputers and distributed training setups.

SoerenMind Aug 6, 2021, 9:18 AM
1 point
on: Why not more small, intense research teams?
In my experience, this worked extremely well. But that was thanks to really good management and coordination which would’ve been hard in other groups I used to be part of.

SoerenMind Aug 4, 2021, 3:57 PM
1 point
in reply to: Alexander Shmidt’s comment on: What made the UK COVID-19 case count drop?
This wouldn’t explain the recent reduction in R because Delta has already been dominant for a while.

SoerenMind Aug 4, 2021, 3:56 PM
7 points
on: What made the UK COVID-19 case count drop?
The $R_{0}$ of Delta is ca. 2x the R0 of the Wuhan strain and this doubles the effect of new immunity on $R_{t}$ .

In fact, the ONS data gives me that ~7% of Scotland had Delta so that’s a reduction in $R_{t}$ of $R_{0}$ *7% = 6*7% = 0.42 just from very recent and sudden natural immunity.

That’s not [edited: forgot to say “not”] enough to explain everything, but there are more factors:

1) Heterogenous immunity: the first people to become immune are often high-risk people who go to superspreader events etc.

2) Vaccinations also went up. E.g. if 5% of Scotland got vaccinated in the relevant period, and that gives a 50% protection against being infected or infecting others (conditional on being infected), that’s another reduction in $R_{t}$ of ca. 6*0.05 = 0.18.

3) Cases were rising and that usually leads to behavior changes like staying at home, cancelling events, and doing more LFD tests at home.

SoerenMind Aug 4, 2021, 3:38 PM
3 points
on: How should my timelines influence my career choice?
Another heuristic is to choose the option where you’re most likely to do exceptionally well. (Cf heavy tailed impact etc). Among other thing this, this pushes you to optimize for the timelines scenario where you can be very successful, and to do the job with the best personal fit.

SoerenMind Jul 25, 2021, 4:08 PM
1 point
in reply to: brp’s comment on: ($1000 bounty) How effective are marginal vaccine doses against the covid delta variant?
Age around 30 and not overweight or obviously unhealthy

SoerenMind

In­fer­ence cost limits the im­pact of ever larger models

2023

Inference cost limits the impact of ever larger models