gwern comments on Inference cost limits the impact of ever larger models

gwern 7 Feb 2025 21:37 UTC
16 points
1
Apropos of very low-latency LLMs and revisiting this topic a little: what does this imply about DRL robotics, rather than animals? Will DRL NNs have to have brains as big as humans in order to run superhuman humanoid robots?

One possible implication is that Portia-like NNs are possible for robotics in general. Robotics may be quite ‘easy’ in that sense.

It is striking that when we look at NN parameter/FLOPS-counts, we generally do not see ‘large’ robotics, vision, or sound models, but LLMs; the largest pure-vision models like PaLI-X are <100b-parameters, the largest robotics are usually <10b, with Gato 1′s ~1b having been, if anything, unusually large because of all the other stuff it was doing. (I’m very behind on the robotics literature so maybe there are now much larger 100b-parameter models as they move into the ‘foundation model’ multi-modal/task scaling paradigm, but I’d bet that there still are none >1,000b.) Even sound/image/video generative models, which would be expected to be much larger than necessary for robotics tasks, are often small enough to run on a single consumer GPU, still. And these are usually trained with scaling laws now, so these are compute-optimal sizes and it is not just that they are wildly under-parameterized (the way almost all models were pre-2020).

So, if robotics is intrinsically easy, but animal brains do not show this because of their latency requirements, which forces them into misleadingly expensive brains, the implication is that we can do robotics by lifting the limitations of biological brains, like being forced to learn in realtime, in the real world, one animal at a time, without any sharing.

We should be able to train deep but small NNs in silico: turning all animal problems into Portia problems, if you will, pausing the simulation to let the NNs think & act for as long as necessary to plan the right action, and only then letting time flow to see what happens, and reset it to try again.

We remove all burdens of wallclock time or caloric consumption or childhood development, which are powerful general robotic controllers, and only then use these teacher-models to optimize low-latency controllers. The wider low-latency student models will be easier to train when they simply must imitate the teacher in a supervised-learning setting instead of RL from scratch, and so the size should be a lot better. (If nothing else, the student models can’t ‘die’ if they make a mistake like breaking a latency constraint, so this learning setting is way easier than an animal’s task.)

On a related note, it is also striking how far down in size LLMs can be pushed. You can get good reasoning out of tiny billion-parameter LLMs trained hard enough on high-quality-enough data, and the ‘densifying experience curve’ is steady and rapid (halving period of ~4 months), so we can expect that at some point we may have superhuman reasoning LLMs in the billion or sub-billion parameter range… which are just very, very ignorant, perhaps even more ignorant than you or me, of all the real-world knowledge & text that a proper LLM has. We can’t train those from scratch, but we can train trillion-parameter LLMs to suck in all the text in the world, and then exhale training data for small fast cheap models.

So it seems that Moravec’s Paradox remains undefeated: as difficult as we find the abstract intellectual capabilities like the process of doing math or reasoning, so difficult we struggle to even write them down to train LLMs on, so difficult to train on we need giant gigawatt datacenters to just get started, they are not intrinsically difficult and in the long run, do not require big expensive NNs.
- jacob_cannell 14 Feb 2025 2:01 UTC
  12 points
  0
  Parent
  The effectiveness of weight sharing (and parameter compression in general) diminishes as you move the domain from physics (simple rules/patterns tiled over all of space/time) up to language/knowledge (downstream facts/knowledge that are far too costly to rederive from simulation).
  
  BNNs cant really take advantage of weight sharing so much, so ANNs that are closer to physics should be much smaller parameter wise, for the same compute and capability. Which is what we observer for lower level sensor/motor modalities.
- SoerenMind 6 Mar 2025 11:48 UTC
  2 points
  0
  Parent
  Good points here.
  
  Btw I sometimes think back to how your 3y old comments on this post have aged well.
- Noosphere89 7 Feb 2025 21:47 UTC
  2 points
  0
  Parent
  It might be at this point just an underinvestment in robotics, compared to other AI.
  
  Admittedly, Gato didn’t have positive transfer, unlike all the other robotic elements.