gwern comments on Inference cost limits the impact of ever larger models

gwern 5 Nov 2021 14:22 UTC
6 points
0

Width-wise parallelism could help but its communication cost scales unfavorably. It grows quadratically as we grow the NN’s width, and then quadratically again when we try to reduce latency by reducing the number of neurons per GPU.

Incidentally, the latency cost of width vs depth is something I’ve thought might explain why the brain/body allometric scaling laws are so unfavorable and what all that expensive brain matter does given that our tiny puny little ANNs seem capable of so much: everything with a meaningful biological brain, from ants to elephants, suffers from hard (fatal) latency requirements. You are simply not allowed by Nature or Darwin to take 5 seconds to compute how to move your legs.* (Why was Gato 1 so small and so unimpressive in many ways? Well, they kept it small because they wanted it to run in realtime for a real robot. A much wider Transformer could’ve still met the deadline… but cost a lot more parameters and training than usual by going off the optimal scaling curves.) It does not matter how many watts or neurons you save by using a deep skinny network, if after 10 layers have fired with another 100 to go to compute the next action to take, you’ve been eaten by a stupider but faster-thinking predator.

So a biological brain might be forced to be deep into an unfavorable point on width vs depth—which might be extremely expensive—in order to meet its subset of robotics-related deadlines, as it were.

* With a striking counterexample, in both tininess of brain and largeness of latency, being Portia. What is particularly striking to me is not that it is so intelligent while being so tiny, but that this seems to be directly due to its particular ecological niche: there are very few creatures out there who need extremely flexible intelligent behavior but also are allowed to have minutes or hours to plan many of its actions… but Portia is one of them, as it is a stealthy predator attacking static prey. The prey also don’t generally have much memory nor can they just leave their web, so a Portia can try again if the first trick didn’t work. So Portia spiders are allowed to do things like spend hours circumnavigating a web to strike its prey spider from the right direction or gradually test out mimicry until it finds the right cue to trick its prey spider. So it’s fascinating to see that in this highly unusual niche, it is possible to have a tiny biological brain execute extremely slow but intelligent strategies, and it suggests that if latency were not a problem, biological brains could be far more intelligent and we would not need to see such architecturally-huge biological brains to reach human-level performance, and then we would no longer have any paradox of why highly-optimized human brains seem to need so many parameters to do the same thing as tiny ANNs.
What links here?
- Noosphere89's comment on Book Review: Consciousness Explained (as the Great Catalyst) by Rafael Harth (20 Jan 2025 23:49 UTC; 4 points)
- gwern 7 Feb 2025 21:37 UTC
  16 points
  1
  Parent
  Apropos of very low-latency LLMs and revisiting this topic a little: what does this imply about DRL robotics, rather than animals? Will DRL NNs have to have brains as big as humans in order to run superhuman humanoid robots?
  
  One possible implication is that Portia-like NNs are possible for robotics in general. Robotics may be quite ‘easy’ in that sense.
  
  It is striking that when we look at NN parameter/FLOPS-counts, we generally do not see ‘large’ robotics, vision, or sound models, but LLMs; the largest pure-vision models like PaLI-X are <100b-parameters, the largest robotics are usually <10b, with Gato 1′s ~1b having been, if anything, unusually large because of all the other stuff it was doing. (I’m very behind on the robotics literature so maybe there are now much larger 100b-parameter models as they move into the ‘foundation model’ multi-modal/task scaling paradigm, but I’d bet that there still are none >1,000b.) Even sound/image/video generative models, which would be expected to be much larger than necessary for robotics tasks, are often small enough to run on a single consumer GPU, still. And these are usually trained with scaling laws now, so these are compute-optimal sizes and it is not just that they are wildly under-parameterized (the way almost all models were pre-2020).
  
  So, if robotics is intrinsically easy, but animal brains do not show this because of their latency requirements, which forces them into misleadingly expensive brains, the implication is that we can do robotics by lifting the limitations of biological brains, like being forced to learn in realtime, in the real world, one animal at a time, without any sharing.
  
  We should be able to train deep but small NNs in silico: turning all animal problems into Portia problems, if you will, pausing the simulation to let the NNs think & act for as long as necessary to plan the right action, and only then letting time flow to see what happens, and reset it to try again.
  
  We remove all burdens of wallclock time or caloric consumption or childhood development, which are powerful general robotic controllers, and only then use these teacher-models to optimize low-latency controllers. The wider low-latency student models will be easier to train when they simply must imitate the teacher in a supervised-learning setting instead of RL from scratch, and so the size should be a lot better. (If nothing else, the student models can’t ‘die’ if they make a mistake like breaking a latency constraint, so this learning setting is way easier than an animal’s task.)
  
  On a related note, it is also striking how far down in size LLMs can be pushed. You can get good reasoning out of tiny billion-parameter LLMs trained hard enough on high-quality-enough data, and the ‘densifying experience curve’ is steady and rapid (halving period of ~4 months), so we can expect that at some point we may have superhuman reasoning LLMs in the billion or sub-billion parameter range… which are just very, very ignorant, perhaps even more ignorant than you or me, of all the real-world knowledge & text that a proper LLM has. We can’t train those from scratch, but we can train trillion-parameter LLMs to suck in all the text in the world, and then exhale training data for small fast cheap models.
  
  So it seems that Moravec’s Paradox remains undefeated: as difficult as we find the abstract intellectual capabilities like the process of doing math or reasoning, so difficult we struggle to even write them down to train LLMs on, so difficult to train on we need giant gigawatt datacenters to just get started, they are not intrinsically difficult and in the long run, do not require big expensive NNs.
  - jacob_cannell 14 Feb 2025 2:01 UTC
    12 points
    0
    Parent
    The effectiveness of weight sharing (and parameter compression in general) diminishes as you move the domain from physics (simple rules/patterns tiled over all of space/time) up to language/knowledge (downstream facts/knowledge that are far too costly to rederive from simulation).
    
    BNNs cant really take advantage of weight sharing so much, so ANNs that are closer to physics should be much smaller parameter wise, for the same compute and capability. Which is what we observer for lower level sensor/motor modalities.
  - SoerenMind 6 Mar 2025 11:48 UTC
    2 points
    0
    Parent
    Good points here.
    
    Btw I sometimes think back to how your 3y old comments on this post have aged well.
  - Noosphere89 7 Feb 2025 21:47 UTC
    2 points
    0
    Parent
    It might be at this point just an underinvestment in robotics, compared to other AI.
    
    Admittedly, Gato didn’t have positive transfer, unlike all the other robotic elements.