Reasons compute may not drive AI capabilities growth

Tristan H19 Dec 2018 22:13 UTC

LW: 42 AF: 17

Compute AI Machine Learning (ML)AI Timelines

How long it will be before humanity is capable of creating general AI is an important factor in discussions of the importance of doing AI alignment research as well as discussions of which research avenues have the best chance of success. One frequently discussed model for estimating AI timelines is that AI capabilities progress is essentially driven by growing compute capabilities. For example, the OpenAI article on AI and Compute presents a compelling narrative, which shows a trend of well-known results in machine learning using exponentially more compute over time. This is an interesting model because if valid we can do some quantitative forecasting, due to somewhat smooth trends in compute metrics which can be extrapolated. However, I think there are a number of reasons to suspect AI progress to be driven more by engineer and researcher effort than compute.

I think there’s a spectrum of models between:

We have an abundance of ideas that aren’t worth the investment to try out yet. Advances in compute capability unlock progress by make researching more expensive techniques economically feasible. We’ll be able to create general AI soon after we have enough compute to do it.
Research proceeds at its own pace and makes use of as much compute is convenient to save researcher time on optimization and achieve flashy results. We’ll be able to create general AI once we come up with all the right ideas behind it, and either:
- We’ll already have enough compute to do it
- We won’t have enough compute and we’ll start optimizing, invest more in compute, and possibly start truly being bottlenecked on compute progress.

My research hasn’t pointed too solidly in either direction, but below I discuss a number of the reasons I’ve thought of that might point towards compute not being a significant driver of progress right now.

There’s many ways to train more efficiently that aren’t widely used

Starting October of 2017, the Stanford DAWNBench contest challenged teams to come up with the fastest and cheapest ways to train neural nets to solve certain tasks.

The most interesting was the ImageNet training time contest. The baseline entry took 10 days and cost $1112; less than one year later the best entries (all by the fast.ai team) were down to 18 minutes for $35, 19 minutes for $18 or 30 minutes for $14[^1]. This is ~800x faster and ~80x cheaper than the baseline.

Some of this was just using more and better hardware, the winning team used 128 V100 GPUs for 18 minutes and 64 for 19 minutes, versus eight K80 GPUs for the baseline. However, substantial improvements were made even on the same hardware. The training time on a p3.16xlarge AWS instance with eight V100 GPUs went down from 15 hours to 3 hours in 4 months. The training time on a single Google Cloud TPU went down from 12 hours to 3 hours as the Google Brain team tuned their training and incorporated ideas from the fast.ai team. An even larger improvement was seen on the CIFAR10 contest recently, with times on a p3.2xlarge improving by 60x with the accompanying blog series still mentioning multiple improvements left on the table due to effort constraints. He also speculates that many of the optimizations would also improve the ImageNet version.

The main techniques used for fast training were all known techniques: progressive resizing, mixed precision training, removing weight decay from batchnorms, scaling up batch size in the middle of training, and gradually warming up the learning rate. They just required engineering effort to implement and weren’t already implemented in the library defaults.

Similarly, the improvement due to scaling from eight K80s to many machines with V100s was partially hardware but also required lots of engineering effort to implement: using mixed precision fp16 training (required to take advantage of the V100 Tensor Cores), efficiently using the network to transfer data, implementing the techniques required for large batch sizes, and writing software for supervising clusters of AWS spot instances.

These results seem to show that it’s possible to train much faster and cheaper by applying knowledge and sufficient engineering effort. Interestingly not even a team at Google Brain working to show off TPUs initially had all the code and knowledge required to get the best available performance, and had to gradually work for it.

I would suspect that in a world where we were bottlenecked hard on training times that these techniques would be more widely known about and applied, and implementations of them readily available for every major machine learning library. Interestingly, in postscripts to both of his articles on how fast.ai managed to achieve such fast times, Jeremy Howard notes that he doesn’t believe large amounts of compute are required for important ML research, and notes that many foundational discoveries were available with little compute.

[^1]: Using spot/preemptible instance pricing instead of the on-demand pricing the benchmark page lists, due to much lower prices and the lack of need for on-demand instances given the short time. The authors of the winning solution wrote software to effectively use spot instances and actually used them for their tests. It may seem unfair to use spot prices for the winning solution but not for the baseline, but a lot of the improvement in the contest came from actually using all the techniques for faster/cheaper training available despite inconvenience, and they had to write software to easily use spot instances and had short enough training times that it was viable without fancy software to automatically transfer training to new machines.

Hyperparameter grid searches are inefficient

I’ve heard hyperparameter grid searches mentioned as a reason why ML research needs way more compute than it would appear based on the training time of the models used. However, I can also see the use of grid searches as evidence of an abundance of compute rather than a scarcity.

As far as I can tell it’s possible to find hyperparameters much more efficiently than a grid search, it just takes more human time and engineering implementation effort. There’s a large literature of more efficient hyperparameter search methods but as far as I can tell they aren’t very popular (I’ve never heard of anyone using one in practice, and all open source implementations of these kind of things I can find have few Github stars).

Researcher Leslie Smith also has a number of papers with little-used ideas on principled approaches to choosing and searching for optimal hyperparameters with much less effort, including a fast automatic procedure for finding optimal learning rates. This suggests that it’s possible to substitute hyperparameter search time for more engineering, human decision-making and research effort.

There’s also likely room for improvement in the factorization of the hyper-parameters we use so that they’re more amenable to separate optimization. For example, L2 regularization is usually used in place of weight decay because they theoretically do the same thing, but this paper points out that not only do they not do the same thing with ADAM and using weight decay causes ADAM to surpass the more popular SGD with momentum in practice, but that weight decay is a better hyper-parameter since the optimal weight decay is more independent of learning rate than L2 regularization strength is.

All of this suggests that most researchers might be operating under an abundance of cheap compute relative to their problems that leads to them not investing the effort required to more efficiently optimize their hyperparameters and just do so haphazardly or with grid searches instead.

The types of compute we need may not improve very quickly

Improvements in computing hardware are not uniform and there are many different hardware attributes that can be bottlenecks for different things. AI progress may rely on one or more of these that don’t end up improving quickly, becoming bottlenecked on the slowest one rather than experiencing exponential growth.

Machine learning accelerators

Modern machine learning is largely composed of large operations that are either directly matrix multiplies or can be decomposed into them. It’s also possible to train using much lower precision than full 32-bit floating point using some tricks. This allows the creation of specialized training hardware like Google’s TPUs and Nvidia Tensor Cores. A number of other companies have also announced they’re working on custom accelerators.

The first generation of specialized hardware delivered a large one-time improvement, but we can also expect continuing innovation in accelerator architecture. There will likely be sustained innovations in training with different number formats and architectural optimizations for faster and cheaper training. I expect this will be the area our compute capability will grow the most, but may flatten like CPUs have once we figure out enough of the easily discoverable improvements.

CPUs

Reinforcement learning simulations like the OpenAI Five DOTA bot, and various physics playgrounds, often use CPU-heavy serial simulations. OpenAI Five uses 128,000 CPU cores and only 256 GPUs. At current Google Cloud preemptible prices the CPUs cost 5-10x more than the GPUs in total. Improvements in machine learning training ability will still leave the large cost of the CPUs. If the use of expensive simulations that run best on CPUs becomes an important part of training advanced agents, progress may become bottlenecked on CPU cost.

Additionally, improvement in CPU compute costs may be slowing. Cloud CPU costs only decreased 45% from 2012 to 2017 and performance per dollar for buying the hardware only improved 2x.. Google Cloud Compute prices have only dropped 25% from 2014-2018. Although the introduction of preemptible prices 30% of full price in 2016 was a big improvement, and that decreased to 20% of full price in 2017.

GPU/accelerator memory

Another scarce resource is memory on the GPU/accelerator used for training. The memory must be large enough to store all the model parameters, the input, the gradients, and other optimization parameters.

This is one of the most frequent limits I see referenced in machine learning papers nowadays. For example the new large BERT language model can only be trained properly on TPUs with their 64GB of RAM. The Glow paper needs to use gradient checkpointing and an alternative to batchnorm so that they can use gradient accumulation, because only a single sample of gradients fits on a GPU.

However there are ways to address this limitation that aren’t frequently used. Glow already uses the two best ones, gradient checkpointing and gradient accumulation, but did not implement an optimization they mentioned which would make the amount of memory the model takes constant in the number of layers instead of linear, likely because it would be difficult to engineer into existing ML frameworks. The BERT implementation uses none of the techniques because they just use a TPU with enough memory, in fact a reimplementation of BERT implemented 3 such techniques and got it to fit on a GPU. Thus it still seems that in a world with less RAM these might still have happened, just with more difficulty or smaller demonstration models.

Interestingly, the maximum available RAM per device barely changed from 2014 through 2017 with the NVIDIA K80′s 24GB, but then shot up in 2018 to 48GB with the RTX 8000 as well as the 64GB TPU v2 and 128GB TPU v3. Probably both because of demand for larger device memories for machine learning training, as well as the availability of high capacity HBM memory. It’s unclear to me if this rapid rise will continue or if it was mostly a one-time change reflecting new demands for the largest possible memories reaching the market.

It’s also possible that per-device memory will cease to be a constraint on model size due to faster hardware interconnects that allow sharing a model across the memory of multiple devices like Intel’s Nervana and Tensorflow Mesh plan to do. It also seems likely that techniques for splitting models across devices to fit in memory, like the original AlexNet did, will become more popular. It may be the case that the fact that we don’t split models across devices like AlexNet anymore is evidence that we’re not constrained by RAM much but I’m not sure.

Limited ability to exploit parallelism

As discussed extensively in a new paper from Google Brain, there seems to be a limit on how much data parallelism in the form of larger batch sizes we can currently extract out of a given model. If this constraint isn’t worked around, wall time to train models could stall even if compute power continues to grow.

However the paper mentions that various things like model architecture and regularization affect this limit and I think it’s pretty likely that techniques to increase this limit will continue to be discovered so it isn’t a bottleneck. A newer paper by OpenAI finds that more difficult problems also tolerate larger batch sizes. Even if the limit remains, increasing compute would allow training more different models in parallel, potentially just meaning that more parameter search and evolution gets layered on top of the training. I also suspect that just using ever-larger models may allow use of more compute without increasing batch sizes.

At the moment, it seems that we know how to train effectively with batch sizes large enough to saturate large clusters, for example this paper about training ImageNet in 7 minutes with a 64k batch size. But this requires extra tuning and implementing some tricks, even just to train on mid-size clusters, so as far as I know only a small fraction of all machine learning researchers regularly train on large clusters (anecdotally, I’m uncertain about this).

Conclusion

These all seem to point towards compute being abundant and ideas being the bottleneck, but not solidly. For the points about training efficiency and grid searches this could just be an inefficiency in ML research and all the major AGI progress will be made by a few well-funded teams at the boundaries of modern compute that have solved these problems internally.

Vaniver commented on a draft of this post that it’s interesting to consider the case where training time is the bottleneck rather than ideas, but massive engineering effort is highly effective at reducing training time. In this case an increase in investment in AI research which lead to hiring more engineers to apply techniques to speed up training could lead to rapid progress. This world might also lead to more sizable differences in capabilities between organizations, if large somewhat serial software engineering investments are required to make use of the most powerful techniques, rather than a well-funded newcomer being able to just read papers and buy all the necessary hardware.

The course of various compute hardware attributes seems uncertain both in terms of how fast they’ll progress and whether or not we’ll need to rely on anything other than special-purpose accelerator speed. Since the problem is complex with many unknowns, I’m still highly uncertain, but all of these points did move me to varying degrees in the direction of continuing compute growth not being a driver of dramatic progress.

Thanks to Vaniver and Buck Shlegeris for discussions that lead to some of the thoughts in this post.

What links here?