Jacob’s argument in the Density and Temperature section of his Brain Efficiency post basically just fails.
Jacob is using a temperature formula for blackbody radiators, which is basically irrelevant to temperature of realistic compute substrate—brains, chips, and probably future compute substrates are all cooled by conduction through direct contact with something cooler (blood for the brain, heatsink/air for a chip). The obvious law to use instead would just be the standard thermal conduction law: heat flow per unit area proportional to temperature gradient.
Jacob’s analysis in that section also fails to adjust for how, by his own model in the previous section, power consumption scales linearly with system size (and also scales linearly with temperature).
Put all that together, and a more sensible formula would be:
qA=C1TSRR2=C2(TS−TE)R
… where:
R is radius of the system
A is surface area of thermal contact
q is heat flow out of system
TS is system temperature
TE is environment temperature (e.g. blood or heat sink temperature)
C1,C2 are constants with respect to system size and temperature
(Of course a spherical approximation is not great, but we’re mostly interested in change as all the dimensions scale linearly, so the geometry shouldn’t matter for our purposes.)
First key observation: all the R’s cancel out. If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick. So, overall, equilibrium temperature stays the same as the system scales down.
So in fact scaling down is plausibly free, for purposes of heat management. (Though I’m not highly confident that would work in practice. In particular, I’m least confident about the temperature gradient scaling with inverse system size, in practice.)
On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing. According to this meta-analysis, the average temperature delta between e.g. brain and blood is at most ~2.5 C, so even liquid nitrogen would be enough to achieve ~100x larger temperature delta if the system were at the same temperature as the brain; we don’t even need to go to liquid helium for that.
In terms of scaling, our above formula says that TS will scale proportionally to TE. Halve the environment temperature, halve the system temperature. And that result I do expect to be pretty robust (for systems near Jacob’s interconnect Landauer limit), since it just relies on temperature scaling of the Landauer limit plus heat flow being proportional to temperature delta.
I’m going to make this slightly more legible, but not contribute new information.
Note that downthread, Jacob says:
the temp/size scaling part is not one of the more core claims so any correction there probably doesn’t change the conclusion much.
So if your interest is in Jacob’s arguments as they pertain to AI safety, this chunk of Jacob’s writings is probably not key for your understanding and you may want to focus your attention on other aspects.
Both Jacob and John agree on the obvious fact that active cooling is necessary for both the brain and for GPUs and a crucial aspect of their design.
Jacob:
Humans have evolved exceptional heat dissipation capability using the entire skin surface for evaporative cooling: a key adaption that supports both our exceptional long distance running ability, and our oversized brains...
Current 2021 gpus have a power density approaching 106 W / m2, which severely constrains the design to that of a thin 2D surface to allow for massive cooling through large heatsinks and fans...
John:
… brains, chips, and probably future compute substrates are all cooled by conduction through direct contact with something cooler (blood for the brain, heatsink/air for a chip)..
On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing. According to this meta-analysis, the average temperature delta between e.g. brain and blood is at most ~2.5 C, so even liquid nitrogen would be enough to achieve ~100x larger temperature delta if the system were at the same temperature as the brain; we don’t even need to go to liquid helium for that.
Where they disagree is on two points:
Whether temperature of GPUs/brains scales with their surface area
Tractability of dealing with higher temperatures in scaled-down computers with active cooling
Jacob applies the Stefan-Boltzmann Law for black body radiators. In this model, temperature output scales with both energy and surface area:
T=(Meσ)14
Where Me is the power per unit surface area in W/m2, and σ is the Stefan-Boltzmann constant.
In comments, he rationalizes this choice by saying:
SB law describes the relationship to power density of a surface and corresponding temperature; it just gives you an idea of the equivalent temperature sans active cooling… That section was admittedly cut a little short, if I had more time/length it would justify a deeper dive into the physics of cooling and how much of a constraint that could be on the brain. You’re right though that the surface power density already describes what matters for cooling.
And downthread, he says:
I (and the refs I linked) use that as a starting point to to indicate the temp the computing element would achieve without convective cooling (ie in vacuum or outer space).
John advocates an alternative formula for heat flow:
Put all that together, and a more sensible formula would be:
qA=C1TSRR2=C2(TS−TE)R
… where:
R is radius of the system
A is surface area of thermal contact
q is heat flow out of system
TS is system temperature
TE is environment temperature (e.g. blood or heat sink temperature)
C1,C2 are constants with respect to system size and temperature
R cancels out. I’m also going to move A over to the other side, ignore the constants for our conceptual purposes, and cut out the middle part of the equation, leaving us with:
q=A(TS−TE)
In language, the heat flow out of the brain/GPU and into its cooling system (i.e. blood, a heatsink) is proportional to (area of contact) x (temperature difference).
At first glance, this would appear to also show that as you scale down, heat flow out of the system will decrease because there’ll be less available area for thermal contact. They key point is whether or not power consumption stays the same as you scale down.
Here is Jacob’s description of what happens to power consumption in GPUs as you scale down:
Current 2021 gpus have a power density approaching 106 W / m2, which severely constrains the design to that of a thin 2D surface...
This in turn constrains off-chip memory bandwidth to scale poorly: shrinking feature sizes with Moore’s Law by a factor of D increases transistor density by a factor of D2, but at best only increases 2d off-chip wire density by a factor of only D, and doesn’t directly help reduce wire energy cost at all.
And here is John’s model, where he clearly and crucially disagrees with Jacob on whether scaling down affects power consumption by shortening wires (relevant text is bolded in the quote above and below).
If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick.
So in fact scaling down is plausibly free, for purposes of heat management...
John also speaks to our ability to upgrade the cooling system:
On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing.
Jacob doesn’t really talk about the limits of our ability to cool GPUs by upgrading the cooling system in this section, talking only of the thin 2D design of GPUs being motivated by a need to achieve “massive cooling through large heatsinks and fans.” Ctrl+F does not find the words “nitrogen” and “helium” in his post, and only the version of John’s comment in DaemonicSigil’s rebuttal to Jacob contains those terms. I am not sure if Jacob has expanded on his thoughts on the limits of higher-performance cooling elsewhere in his many comment replies.
So as far as I can tell, this is where the chain of claims and counter-claims is parked for now: a disagreement over power consumption changes as wires are shortened, and a disagreement on how practical it is for better cooling to allow further miniaturization even if scaling down does result in decreased heat flows and thus higher temperatures inside of the GPU. I expect there might be disagreement over whether scaling down will permit thinning of the surface (as John tentatively proposes).
Note that I am not an expert on these specific topics, although I have a biomedical engineering MS—my contribution here is gathering relevant quotes and attempting to show how they relate to each other in a way that’s more convenient than bouncing back and forth between posts. If I have made mistakes, please correct me and I will update this comment. If it’s fundamentally wrong, rather than having a couple local errors, I’ll probably just delete it as I don’t want to add noise to the discussion.
Strongly upvoted for taking the effort to sum up the debate between these two.
Just a brief comment from me, this part:
If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick.
Only makes sense in the context of a specified temperature range and wire material. I’m not sure if it was specified elsewhere or not.
A trivial example, A superconducting wire at 50 K will certainly not have it’s power consumption halved by scaling down a factor of 2, since it’s consumption is already practically zero (though not perfectly zero).
but wire volume requirements scale linearly with dimension. So if we ignore all the machinery required for cellular maintenance and cooling, this indicates the brain is at most about 100x larger than strictly necessary (in radius), and more likely only 10x larger.
However, even though the wiring energy scales linearly with radius, the surface area power density which crucially determines temperature scales with the inverse squared radius, and the minimal energy requirements for synaptic computation are radius invariant.
Radius there refers to brain radius, not wire radius. Unfortunately there are two meanings of wiring energy or wire energy. By ‘wiring energy’ above hopefully the context helps make clear that I meant the total energy used by brain wiring/interconnect, not the ‘wire energy’ in terms of energy per bit per nm, which is more of a fixed constant that depends on wire design tradeoffs.
So my model was/is that if we assume you could just take the brain and keep the same amount of compute (neurons/synapses/etc) but somehow shrink the entire radius by a factor of D, this would decrease total wiring energy by the same factor D by just shortening all the wires in the obvious way.
However, the surface power density scales with radius as 1/R2, so the net effect is that surface power density from interconnect scales with 1/R, ie it increases by a factor of D as you shrink by a factor of D, which thereby increases your cooling requirement (in terms of net heat flow) by the same factor D. But since the energy use of synaptic computation does not change, that just quickly dominates scaling with 1/R2 and thus D2.
In the section you quoted where I say:
This in turn constrains off-chip memory bandwidth to scale poorly: shrinking feature sizes with Moore’s Law by a factor of D increases transistor density by a factor of D2, but at best only increases 2d off-chip wire density by a factor of only D, and doesn’t directly help reduce wire energy cost at all.
Now I have moved to talking about 2D microchips, and “wire energy” here means the energy per bit per nm, which again doesn’t scale with device size. Also the D here is scaling in a somewhat different way—it is referring to reducing the size of all devices as in normal moore’s law shrinkage while holding the total chip size constant, increasing device density.
Looking back at that section I see numerous clarifications I would make now, and I would also perhaps focus more on the surface power density as a function of size, and perhaps analyze cooling requirements. However I think it is reasonably clear from the document that shrinking the brain radius by a factor of X increases the surface power density (and thus cooling requirements in terms of coolant flow at fixed coolant temp) from synaptic computation by X2 and from interconnect wiring by X.
In practice digital computers are approaching the limits of miniaturization and tend to be 2D for fast logic chips in part for cooling considerations as I describe. The cerebras wafer for example represents a monumental engineering advance in terms of getting power in and pumping heat out to a small volume, but they still use a 2D chip design, not 3D, because 2D allows you dramatically more surface area for pumping in power and out heat than a 3D design, at the sacrifice of much worse interconnect geometry scaling in terms of latency and bandwidth.
We can make 3D chips today and do, but that tends to be most viable for memory rather than logic, because memory has far lower power density (and the brain being neuromorphic is more like a giant memory chip with logic sprinkled around right next to each memory unit).
(Note that this, in turn, also completely undermines the claims about optimality of speed in the next section. Those claims ultimately ground out in high temperatures making high clock speeds prohibitive, e.g. this line:
Scaling a brain to GHz speeds would increase energy and thermal output into the 10MW range, and surface power density to 109W / m2, with temperatures well above the surface of the sun
I think you may be misunderstanding why I used the blackbody temp—I (and the refs I linked) use that as a starting point to to indicate the temp the computing element would achieve without convective cooling (ie in vacuum or outer space). So when I (or the refs I link) mention “temperatures greater than the surface of the sun” for the surface of some CMOS processor, it is not because we actually believe your GPU achieves that temperature (unless you have some critical cooling failure or short circuit, in which case it briefly achieves a very high temperature before melting somewhere).
So in fact scaling down is plausibly free, for purposes of heat management. (Though I’m not highly confident that would work in practice. In particular, I’m least confident about the temperature gradient scaling with inverse system size, in practice.)
I think this makes all the wrong predictions and so is likely wrong, but I will consider it more.
On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing.
Of course—not really relevant for the brain, but that is an option for computers. Obviously you aren’t gaining thermodynamic efficiency by doing so—you pay extra energy to transport the heat.
All that being said, I’m going to look into this more and if I feel a correction to the article is justified I will link to your comment here with a note. But the temp/size scaling part is not one of the more core claims so any correction there probably doesn’t change the conclusion much.
I think you may be misunderstanding why I used the blackbody temp—I (and the refs I linked) use that as a starting point to to indicate the temp the computing element would achieve without convective cooling (ie in vacuum or outer space).
There’s a pattern here which seems-to-me to be coming up repeatedly (though this is the most legible example I’ve seen so far). There’s a key qualifier which you did not actually include in your post, which would make the claims true. But once that qualifier is added, it’s much more obvious that the arguments are utterly insufficient to back up big-sounding claims like:
Thus even some hypothetical superintelligence, running on non-exotic hardware, will not be able to think much faster than an artificial brain running on equivalent hardware at the same clock rate.
Like, sure, our hypothetical superintelligence can’t build highly efficient compute which runs in space without any external cooling machinery. So, our hypothetical superintelligence will presumably build its compute with external cooling machinery, and then this vacuum limit just doesn’t matter.
You could add all those qualifiers to the strong claims about superintelligence, but then they will just not be very strong claims. (Also, as an aside, I think the wording of the quoted section is not the claim you intended to make, even ignoring qualifiers? The quote is from the speed section, but “equivalent hardware at the same clock rate” basically rules out any hardware speed difference by construction. I’m responding here to the claim which I think you intended to argue for in the speed section.)
Obviously you aren’t gaining thermodynamic efficiency by doing so—you pay extra energy to transport the heat.
Note that you also potentially save energy by running at a lower temperature, since the Landauer limit scales down with temperature. I think it comes out to roughly a wash: operate at 10x lower temperature, and power consumption can drop by 10x (at Landauer limit), but you have to pay 9x the (now reduced) power consumption in work to pump that heat back up to the original temperature. So, running at lower temperature ends up energy-neutral if we’re near thermodynamic limits for everything.
The ‘big-sounding’ claim you quoted makes more sense only with the preceding context you omitted:
Conclusion: The brain is a million times slower than digital computers, but its slow speed is probably efficient for its given energy budget, as it allows for a full utilization of an enormous memory capacity and memory bandwidth. As a consequence of being very slow, brains are enormously circuit cycle efficient. Thus even some hypothetical superintelligence, running on non-exotic hardware, will not be able to think much faster than an artificial brain running on equivalent hardware at the same clock rate.
Because of its slow speed, the brain is super-optimized for intelligence per clock cycle. So digital superintelligences can think much faster, but to the extent they do so they are constrained to be brain-like in design (ultra optimized for low circuit depth). I have a decade old post analyzing/predicting this here, and today we have things like GPT4 which imitate the brain but run 1000x to 10000x times faster during training, and thus accel at writing.
I agree the blackbody formula doesn’t seem that relevant, but it’s also not clear what relevance Jacob is claiming it has. He does discuss that the brain is actively cooled. So let’s look at the conclusion of the section:
Conclusion: The brain is perhaps 1 to 2 OOM larger than the physical limits for a computer of equivalent power, but is constrained to its somewhat larger than minimal size due in part to thermodynamic cooling considerations.
If the temperature-gradient-scaling works and scaling down is free, this is definitely wrong. But you explicitly flag your low confidence in that scaling, and I’m pretty sure it wouldn’t work.* In which case, if the brain were smaller, you’d need either a hotter brain or a colder environment.
I think that makes the conclusion true (with the caveat that ‘considerations’ are not ‘fundamental limits’).
(My gloss of the section is ‘you could potential make the brain smaller, but it’s the size it is because cooling is expensive in a biological context, not necessarily because blind-idiot-god evolution left gains on the table’).
* I can provide some hand-wavy arguments about this if anyone wants.
My gloss of the section is ‘you could potential make the brain smaller, but it’s the size it is because cooling is expensive in a biological context, not necessarily because blind-idiot-god evolution left gains on the table’
I tentatively buy that, but then the argument says little-to-nothing about barriers to AI takeoff. Like, sure, the brain is efficient subject to some constraint which doesn’t apply to engineered compute hardware. More generally, the brain is probably efficient relative to lots of constraints which don’t apply to engineered compute hardware. A hypothetical AI designing hardware will have different constraints.
Either Jacob needs to argue that the same limiting constraints carry over (in which case hypothetical AI can’t readily outperform brains), or he does not have a substantive claim about AI being unable to outperform brains. If there’s even just one constraint which is very binding for brains, but totally tractable for engineered hardware, then that opens the door to AI dramatically outperforming brains.
I tentatively buy that, but then the argument says little-to-nothing about barriers to AI takeoff. Like, sure, the brain is efficient subject to some constraint which doesn’t apply to engineered compute hardware.
The main constraint at minimal device sizes is the thermodynamic limit for irreversible computers, so the wire energy constraint is dominant there.
However the power dissipation/cooling ability for a 3D computer only scales with the surface area d2, whereas compute device density scales with d3 and interconnect scales somewhere in between.
The point of the temperature/cooling section was just to show that shrinking the brain by a factor of X (if possible given space requirements of wire radius etc), would increase surface power density by a factor of X2, but only would decrease wire length&energy by X and would not decrease synapse energy at all.
2D chips scale differently of course: the surface area and heat dissipation tend to both scale with d2. Conventional chips are already approaching miniaturization limits and will dissipate too much power at full activity, but that’s a separate investigation. 3D computers like the brain can’t run that hot given any fixed tech ability to remove heat per unit surface area. 2D computers are also obviously worse in many respects, as long range interconnect bandwidth (to memory) only scales with d rather than the d2 of compute which is basically terrible compared to a 3D system where compute density and long-range interconnect scales d3 and d2 respectively.
Had it turned out that the brain was big because blind-idiot-god left gains on the table, I’d have considered it evidence of more gains lying on other tables and updated towards faster takeoff.
I mean, sure, but I doubt that e.g. Eliezer thinks evolution is inefficient in that sense.
Basically, there are only a handful of specific ways we should expect to be able to beat evolution in terms of general capabilities, a priori:
Some things just haven’t had very much time to evolve, so they’re probably not near optimal. Broca’s area would be an obvious candidate, and more generally whatever things separate human brains from other apes.
There’s ways to nonlocally redesign the whole system to jump from one local optimum to somewhere else.
We’re optimizing against an environment different from the ancestral environment, or structural constraints different from those faced by biological systems, such that some constraints basically cease to be relevant. The relative abundance of energy is one standard example of a relaxed environmental constraint; the birth canal as a limiting factor on human brain size during development or the need to make everything out of cells are standard examples of relaxed structural constraints.
One particularly important sub-case of “different environment”: insofar as the ancestral environment mostly didn’t change very quickly, evolution didn’t necessarily select heavily for very generalizable capabilities. The sphex wasp behavior is a standard example. A hypothetical AI designer would presumably design/select for generalization directly.
(I expect that Eliezer would agree with roughly this characterization, by the way. It’s a very similar way-of-thinking to Inadequate Equilibria, just applied to bio rather than econ.) These kinds of loopholes leave ample space to dramatically improve on the human brain.
Interesting—I think I disagree most with 1. The neuroscience seems pretty clear that the human brain is just a scaled up standard primate brain, the secret sauce is just language (I discuss this now and again in some posts and in my recent part 2). In other words—nothing new about the human brain has had much time to evolve, all evolution did was tweak a few hyperparams mostly around size and neotany (training time): very very much like GPT-N scaling (which my model predicted).
Basically human technology beats evolution because we are not constrained to use self replicating nanobots built out of common locally available materials for everything. A jet airplane design is not something you can easily build out of self replicating nanobots—it requires too many high energy construction processes and rare materials spread across the earth.
Microchip fabs and their outputs are the pinnacle of this difference—requiring rare elements across the periodic table, massively complex global supply chains and many steps of intricate high energy construction/refinement processes all throughout.
What this ends up buying you mostly is very high energy densities—useful for engines, but also for fast processors.
Yeah, the main changes I’d expect in category 1 are just pushing things further in the directions they’re already moving, and then adjusting whatever else needs to be adjusted to match the new hyperparameter values.
One example is brain size: we know brains have generally grown larger in recent evolutionary history, but they’re locally-limited by things like e.g. birth canal size. Circumvent the birth canal, and we can keep pushing in the “bigger brain” direction.
Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction, the various physiological problems those variants can cause need to be offset by other simultaneous changes, which is the sort of thing a designer can do a lot faster than evolution can. (And note that, given how much the Ashkenazi dominated the sciences in their heyday, that’s the sort of change which could by itself produce sufficiently superhuman performance to decisively outperform human science/engineering, if we can go just a few more standard deviations along the same directions.)
… but I do generally expect that the “different environmental/structural constraints” class is still where the most important action is by a wide margin. In particular, the “selection for generality” part is probably pretty big game, as well as selection pressures for group interaction stuff like language (note that AI potentially allows for FAR more efficient communication between instances), and the need for learning everything from scratch in every instance rather than copying, and generally the ability to integrate quantitatively much more information than was typically relevant or available to local problems in the ancestral environment.
Circumvent the birth canal, and we can keep pushing in the “bigger brain” direction.
Chinchilla scaling already suggests the human brain is too big for our lifetime data, and multiple distal lineages that have very little natural size limits (whales, elephants) ended up plateauing around the same OOM similar brain neuron and synapse counts.
Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction,
Human intelligence in terms of brain arch priors also plateaus, the Ashkenazi just selected a bit stronger towards that plateau. Intelligence also has neotany tradeoffs resulting in numerous ecological niches in tribes—faster to breed often wins.
Chinchilla scaling already suggests the human brain is too big for our lifetime data
So I haven’t followed any of the relevant discussion closely, apologies if I’m missing something, but:
IIUC Chinchilla here references a paper talking about tradeoffs between how many artificial neurons a network has and how much data you use to train it; adding either of those requires compute, so to get the best performance where do you spend marginal compute? And the paper comes up with a function for optimal neurons-versus-data for a given amount of compute, under the paradigm we’re currently using for LLMs. And you’re applying this function to humans.
If so, a priori this seems like a bizarre connection for a few reasons, any one of which seems sufficient to sink it entirely:
Is the paper general enough to apply to human neural architecture? By default I would have assumed not, even if it’s more general than just current LLMs.
Is the paper general enough to apply to human training? By default I would have assumed not. (We can perhaps consider translating the human visual field to a number of bits and taking a number of snapshots per second and considering those to be training runs, but… is there any principled reason not to instead translate to 2x or 0.5x the number of bits or snapshots per second? And that’s just the amount of data, to say nothing of how the training works.)
It seems you’re saying “at this amount of data, adding more neurons simply doesn’t help” rather than “at this amount of data and neurons, you’d prefer to add more data”. That’s different from my understanding of the paper but of course it might say that as well or instead of what I think it says.
To be clear, it seems to me that you don’t just need the paper to be giving you a scaling law that can apply to humans, with more human neurons corresponding to more artificial neurons and more human lifetime corresponding to more training data. You also need to know the conversion functions, to say “this (number of human neurons, amount of human lifetime) corresponds to this (number of artificial neurons, amount of training data)” and I’d be surprised if we can pin down the relevant values of either parameter to within an order of magnitude.
...but again, I acknowledge that you know what you’re talking about here much more than I do. And, I don’t really expect to understand if you explain, so you shouldn’t necessarily put much effort into this. But if you think I’m mistaken here, I’d appreciate a few words like “you’re wrong about the comparison I’m drawing” or “you’ve got the right idea but I think the comparison actually does work” or something, and maybe a search term I can use if I do feel like looking into it more.
The architectural design of a brain, which I think of as an prior on the weights, so I sometimes call it the architectural prior. It is encoded in the genome and is the equivalent of the high level pytorch code for a deep learning model.
Jacob’s analysis in that section also fails to adjust for how, by his own model in the previous section, power consumption scales linearly with system size (and also scales linearly with temperature).
If we fix the neuron/synapse/etc count (and just spread them out evenly across the volume) then length and thus power consumption of interconnect linearly scale with radius R, but the power consumption of compute units (synapses) doesn’t scale at all. Surface power density scales with R2.
First key observation: all the R’s cancel out. If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick
This seems rather obviously incorrect to me:
There is simply a maximum amount of heat/entropy any particle of coolant fluid can extract, based on the temperature difference between the coolant particle and the compute medium
The maximum flow of coolant particles scales with the surface area.
Given a fixed compute temperature limit, coolant temp, and coolant pump rate thus results in a limit on the device radius
But obviously I do agree the brain is nowhere near the technological limits of active cooling in terms of entropy removed per unit surface area per unit time, but that’s also mostly irrelevant because you expend energy to move the heat and the brain has a small energy budget of 20W. Its coolant budget is proportional to it’s compute budget.
Moreover as you scale the volume down the coolant travels a shorter distance and has less time to reach equilibrium temp with the compute volume and thus extract the max entropy (but not sure how relevant that is at brain size scales).
(Copied with some minor edits from here.)
Jacob’s argument in the Density and Temperature section of his Brain Efficiency post basically just fails.
Jacob is using a temperature formula for blackbody radiators, which is basically irrelevant to temperature of realistic compute substrate—brains, chips, and probably future compute substrates are all cooled by conduction through direct contact with something cooler (blood for the brain, heatsink/air for a chip). The obvious law to use instead would just be the standard thermal conduction law: heat flow per unit area proportional to temperature gradient.
Jacob’s analysis in that section also fails to adjust for how, by his own model in the previous section, power consumption scales linearly with system size (and also scales linearly with temperature).
Put all that together, and a more sensible formula would be:
qA=C1TSRR2=C2(TS−TE)R
… where:
R is radius of the system
A is surface area of thermal contact
q is heat flow out of system
TS is system temperature
TE is environment temperature (e.g. blood or heat sink temperature)
C1,C2 are constants with respect to system size and temperature
(Of course a spherical approximation is not great, but we’re mostly interested in change as all the dimensions scale linearly, so the geometry shouldn’t matter for our purposes.)
First key observation: all the R’s cancel out. If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick. So, overall, equilibrium temperature stays the same as the system scales down.
So in fact scaling down is plausibly free, for purposes of heat management. (Though I’m not highly confident that would work in practice. In particular, I’m least confident about the temperature gradient scaling with inverse system size, in practice.)
On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing. According to this meta-analysis, the average temperature delta between e.g. brain and blood is at most ~2.5 C, so even liquid nitrogen would be enough to achieve ~100x larger temperature delta if the system were at the same temperature as the brain; we don’t even need to go to liquid helium for that.
In terms of scaling, our above formula says that TS will scale proportionally to TE. Halve the environment temperature, halve the system temperature. And that result I do expect to be pretty robust (for systems near Jacob’s interconnect Landauer limit), since it just relies on temperature scaling of the Landauer limit plus heat flow being proportional to temperature delta.
I’m going to make this slightly more legible, but not contribute new information.
Note that downthread, Jacob says:
So if your interest is in Jacob’s arguments as they pertain to AI safety, this chunk of Jacob’s writings is probably not key for your understanding and you may want to focus your attention on other aspects.
Both Jacob and John agree on the obvious fact that active cooling is necessary for both the brain and for GPUs and a crucial aspect of their design.
Jacob:
John:
Where they disagree is on two points:
Whether temperature of GPUs/brains scales with their surface area
Tractability of dealing with higher temperatures in scaled-down computers with active cooling
Jacob applies the Stefan-Boltzmann Law for black body radiators. In this model, temperature output scales with both energy and surface area:
In comments, he rationalizes this choice by saying:
And downthread, he says:
John advocates an alternative formula for heat flow:
R cancels out. I’m also going to move A over to the other side, ignore the constants for our conceptual purposes, and cut out the middle part of the equation, leaving us with:
q=A(TS−TE)
In language, the heat flow out of the brain/GPU and into its cooling system (i.e. blood, a heatsink) is proportional to (area of contact) x (temperature difference).
At first glance, this would appear to also show that as you scale down, heat flow out of the system will decrease because there’ll be less available area for thermal contact. They key point is whether or not power consumption stays the same as you scale down.
Here is Jacob’s description of what happens to power consumption in GPUs as you scale down:
And here is John’s model, where he clearly and crucially disagrees with Jacob on whether scaling down affects power consumption by shortening wires (relevant text is bolded in the quote above and below).
John also speaks to our ability to upgrade the cooling system:
Jacob doesn’t really talk about the limits of our ability to cool GPUs by upgrading the cooling system in this section, talking only of the thin 2D design of GPUs being motivated by a need to achieve “massive cooling through large heatsinks and fans.” Ctrl+F does not find the words “nitrogen” and “helium” in his post, and only the version of John’s comment in DaemonicSigil’s rebuttal to Jacob contains those terms. I am not sure if Jacob has expanded on his thoughts on the limits of higher-performance cooling elsewhere in his many comment replies.
So as far as I can tell, this is where the chain of claims and counter-claims is parked for now: a disagreement over power consumption changes as wires are shortened, and a disagreement on how practical it is for better cooling to allow further miniaturization even if scaling down does result in decreased heat flows and thus higher temperatures inside of the GPU. I expect there might be disagreement over whether scaling down will permit thinning of the surface (as John tentatively proposes).
Note that I am not an expert on these specific topics, although I have a biomedical engineering MS—my contribution here is gathering relevant quotes and attempting to show how they relate to each other in a way that’s more convenient than bouncing back and forth between posts. If I have made mistakes, please correct me and I will update this comment. If it’s fundamentally wrong, rather than having a couple local errors, I’ll probably just delete it as I don’t want to add noise to the discussion.
Strongly upvoted for taking the effort to sum up the debate between these two.
Just a brief comment from me, this part:
Only makes sense in the context of a specified temperature range and wire material. I’m not sure if it was specified elsewhere or not.
A trivial example, A superconducting wire at 50 K will certainly not have it’s power consumption halved by scaling down a factor of 2, since it’s consumption is already practically zero (though not perfectly zero).
This is all assuming that the power consumption for a wire is at-or-near the Landauer-based limit Jacob argued in his post.
Thank you for this effort. I will probably end up allocating a share of the prize money for effortposts like these too.
Thank you for the effort in organizing this conversation. I want to clarify a few points.
Around the very beginning of the density & temperature section I wrote:
Radius there refers to brain radius, not wire radius. Unfortunately there are two meanings of wiring energy or wire energy. By ‘wiring energy’ above hopefully the context helps make clear that I meant the total energy used by brain wiring/interconnect, not the ‘wire energy’ in terms of energy per bit per nm, which is more of a fixed constant that depends on wire design tradeoffs.
So my model was/is that if we assume you could just take the brain and keep the same amount of compute (neurons/synapses/etc) but somehow shrink the entire radius by a factor of D, this would decrease total wiring energy by the same factor D by just shortening all the wires in the obvious way.
However, the surface power density scales with radius as 1/R2, so the net effect is that surface power density from interconnect scales with 1/R, ie it increases by a factor of D as you shrink by a factor of D, which thereby increases your cooling requirement (in terms of net heat flow) by the same factor D. But since the energy use of synaptic computation does not change, that just quickly dominates scaling with 1/R2 and thus D2.
In the section you quoted where I say:
Now I have moved to talking about 2D microchips, and “wire energy” here means the energy per bit per nm, which again doesn’t scale with device size. Also the D here is scaling in a somewhat different way—it is referring to reducing the size of all devices as in normal moore’s law shrinkage while holding the total chip size constant, increasing device density.
Looking back at that section I see numerous clarifications I would make now, and I would also perhaps focus more on the surface power density as a function of size, and perhaps analyze cooling requirements. However I think it is reasonably clear from the document that shrinking the brain radius by a factor of X increases the surface power density (and thus cooling requirements in terms of coolant flow at fixed coolant temp) from synaptic computation by X2 and from interconnect wiring by X.
In practice digital computers are approaching the limits of miniaturization and tend to be 2D for fast logic chips in part for cooling considerations as I describe. The cerebras wafer for example represents a monumental engineering advance in terms of getting power in and pumping heat out to a small volume, but they still use a 2D chip design, not 3D, because 2D allows you dramatically more surface area for pumping in power and out heat than a 3D design, at the sacrifice of much worse interconnect geometry scaling in terms of latency and bandwidth.
We can make 3D chips today and do, but that tends to be most viable for memory rather than logic, because memory has far lower power density (and the brain being neuromorphic is more like a giant memory chip with logic sprinkled around right next to each memory unit).
(Note that this, in turn, also completely undermines the claims about optimality of speed in the next section. Those claims ultimately ground out in high temperatures making high clock speeds prohibitive, e.g. this line:
)
For extra clarification, that should perhaps read ” with (uncooled) temperatures well above …” (ie isolated in vacuum).
I think you may be misunderstanding why I used the blackbody temp—I (and the refs I linked) use that as a starting point to to indicate the temp the computing element would achieve without convective cooling (ie in vacuum or outer space). So when I (or the refs I link) mention “temperatures greater than the surface of the sun” for the surface of some CMOS processor, it is not because we actually believe your GPU achieves that temperature (unless you have some critical cooling failure or short circuit, in which case it briefly achieves a very high temperature before melting somewhere).
I think this makes all the wrong predictions and so is likely wrong, but I will consider it more.
Of course—not really relevant for the brain, but that is an option for computers. Obviously you aren’t gaining thermodynamic efficiency by doing so—you pay extra energy to transport the heat.
All that being said, I’m going to look into this more and if I feel a correction to the article is justified I will link to your comment here with a note. But the temp/size scaling part is not one of the more core claims so any correction there probably doesn’t change the conclusion much.
There’s a pattern here which seems-to-me to be coming up repeatedly (though this is the most legible example I’ve seen so far). There’s a key qualifier which you did not actually include in your post, which would make the claims true. But once that qualifier is added, it’s much more obvious that the arguments are utterly insufficient to back up big-sounding claims like:
Like, sure, our hypothetical superintelligence can’t build highly efficient compute which runs in space without any external cooling machinery. So, our hypothetical superintelligence will presumably build its compute with external cooling machinery, and then this vacuum limit just doesn’t matter.
You could add all those qualifiers to the strong claims about superintelligence, but then they will just not be very strong claims. (Also, as an aside, I think the wording of the quoted section is not the claim you intended to make, even ignoring qualifiers? The quote is from the speed section, but “equivalent hardware at the same clock rate” basically rules out any hardware speed difference by construction. I’m responding here to the claim which I think you intended to argue for in the speed section.)
Note that you also potentially save energy by running at a lower temperature, since the Landauer limit scales down with temperature. I think it comes out to roughly a wash: operate at 10x lower temperature, and power consumption can drop by 10x (at Landauer limit), but you have to pay 9x the (now reduced) power consumption in work to pump that heat back up to the original temperature. So, running at lower temperature ends up energy-neutral if we’re near thermodynamic limits for everything.
The ‘big-sounding’ claim you quoted makes more sense only with the preceding context you omitted:
Because of its slow speed, the brain is super-optimized for intelligence per clock cycle. So digital superintelligences can think much faster, but to the extent they do so they are constrained to be brain-like in design (ultra optimized for low circuit depth). I have a decade old post analyzing/predicting this here, and today we have things like GPT4 which imitate the brain but run 1000x to 10000x times faster during training, and thus accel at writing.
I agree the blackbody formula doesn’t seem that relevant, but it’s also not clear what relevance Jacob is claiming it has. He does discuss that the brain is actively cooled. So let’s look at the conclusion of the section:
If the temperature-gradient-scaling works and scaling down is free, this is definitely wrong. But you explicitly flag your low confidence in that scaling, and I’m pretty sure it wouldn’t work.* In which case, if the brain were smaller, you’d need either a hotter brain or a colder environment.
I think that makes the conclusion true (with the caveat that ‘considerations’ are not ‘fundamental limits’).
(My gloss of the section is ‘you could potential make the brain smaller, but it’s the size it is because cooling is expensive in a biological context, not necessarily because blind-idiot-god evolution left gains on the table’).
* I can provide some hand-wavy arguments about this if anyone wants.
I tentatively buy that, but then the argument says little-to-nothing about barriers to AI takeoff. Like, sure, the brain is efficient subject to some constraint which doesn’t apply to engineered compute hardware. More generally, the brain is probably efficient relative to lots of constraints which don’t apply to engineered compute hardware. A hypothetical AI designing hardware will have different constraints.
Either Jacob needs to argue that the same limiting constraints carry over (in which case hypothetical AI can’t readily outperform brains), or he does not have a substantive claim about AI being unable to outperform brains. If there’s even just one constraint which is very binding for brains, but totally tractable for engineered hardware, then that opens the door to AI dramatically outperforming brains.
The main constraint at minimal device sizes is the thermodynamic limit for irreversible computers, so the wire energy constraint is dominant there.
However the power dissipation/cooling ability for a 3D computer only scales with the surface area d2, whereas compute device density scales with d3 and interconnect scales somewhere in between.
The point of the temperature/cooling section was just to show that shrinking the brain by a factor of X (if possible given space requirements of wire radius etc), would increase surface power density by a factor of X2, but only would decrease wire length&energy by X and would not decrease synapse energy at all.
2D chips scale differently of course: the surface area and heat dissipation tend to both scale with d2. Conventional chips are already approaching miniaturization limits and will dissipate too much power at full activity, but that’s a separate investigation. 3D computers like the brain can’t run that hot given any fixed tech ability to remove heat per unit surface area. 2D computers are also obviously worse in many respects, as long range interconnect bandwidth (to memory) only scales with d rather than the d2 of compute which is basically terrible compared to a 3D system where compute density and long-range interconnect scales d3 and d2 respectively.
Had it turned out that the brain was big because blind-idiot-god left gains on the table, I’d have considered it evidence of more gains lying on other tables and updated towards faster takeoff.
I mean, sure, but I doubt that e.g. Eliezer thinks evolution is inefficient in that sense.
Basically, there are only a handful of specific ways we should expect to be able to beat evolution in terms of general capabilities, a priori:
Some things just haven’t had very much time to evolve, so they’re probably not near optimal. Broca’s area would be an obvious candidate, and more generally whatever things separate human brains from other apes.
There’s ways to nonlocally redesign the whole system to jump from one local optimum to somewhere else.
We’re optimizing against an environment different from the ancestral environment, or structural constraints different from those faced by biological systems, such that some constraints basically cease to be relevant. The relative abundance of energy is one standard example of a relaxed environmental constraint; the birth canal as a limiting factor on human brain size during development or the need to make everything out of cells are standard examples of relaxed structural constraints.
One particularly important sub-case of “different environment”: insofar as the ancestral environment mostly didn’t change very quickly, evolution didn’t necessarily select heavily for very generalizable capabilities. The sphex wasp behavior is a standard example. A hypothetical AI designer would presumably design/select for generalization directly.
(I expect that Eliezer would agree with roughly this characterization, by the way. It’s a very similar way-of-thinking to Inadequate Equilibria, just applied to bio rather than econ.) These kinds of loopholes leave ample space to dramatically improve on the human brain.
Interesting—I think I disagree most with 1. The neuroscience seems pretty clear that the human brain is just a scaled up standard primate brain, the secret sauce is just language (I discuss this now and again in some posts and in my recent part 2). In other words—nothing new about the human brain has had much time to evolve, all evolution did was tweak a few hyperparams mostly around size and neotany (training time): very very much like GPT-N scaling (which my model predicted).
Basically human technology beats evolution because we are not constrained to use self replicating nanobots built out of common locally available materials for everything. A jet airplane design is not something you can easily build out of self replicating nanobots—it requires too many high energy construction processes and rare materials spread across the earth.
Microchip fabs and their outputs are the pinnacle of this difference—requiring rare elements across the periodic table, massively complex global supply chains and many steps of intricate high energy construction/refinement processes all throughout.
What this ends up buying you mostly is very high energy densities—useful for engines, but also for fast processors.
Yeah, the main changes I’d expect in category 1 are just pushing things further in the directions they’re already moving, and then adjusting whatever else needs to be adjusted to match the new hyperparameter values.
One example is brain size: we know brains have generally grown larger in recent evolutionary history, but they’re locally-limited by things like e.g. birth canal size. Circumvent the birth canal, and we can keep pushing in the “bigger brain” direction.
Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction, the various physiological problems those variants can cause need to be offset by other simultaneous changes, which is the sort of thing a designer can do a lot faster than evolution can. (And note that, given how much the Ashkenazi dominated the sciences in their heyday, that’s the sort of change which could by itself produce sufficiently superhuman performance to decisively outperform human science/engineering, if we can go just a few more standard deviations along the same directions.)
… but I do generally expect that the “different environmental/structural constraints” class is still where the most important action is by a wide margin. In particular, the “selection for generality” part is probably pretty big game, as well as selection pressures for group interaction stuff like language (note that AI potentially allows for FAR more efficient communication between instances), and the need for learning everything from scratch in every instance rather than copying, and generally the ability to integrate quantitatively much more information than was typically relevant or available to local problems in the ancestral environment.
Chinchilla scaling already suggests the human brain is too big for our lifetime data, and multiple distal lineages that have very little natural size limits (whales, elephants) ended up plateauing around the same OOM similar brain neuron and synapse counts.
Human intelligence in terms of brain arch priors also plateaus, the Ashkenazi just selected a bit stronger towards that plateau. Intelligence also has neotany tradeoffs resulting in numerous ecological niches in tribes—faster to breed often wins.
So I haven’t followed any of the relevant discussion closely, apologies if I’m missing something, but:
IIUC Chinchilla here references a paper talking about tradeoffs between how many artificial neurons a network has and how much data you use to train it; adding either of those requires compute, so to get the best performance where do you spend marginal compute? And the paper comes up with a function for optimal neurons-versus-data for a given amount of compute, under the paradigm we’re currently using for LLMs. And you’re applying this function to humans.
If so, a priori this seems like a bizarre connection for a few reasons, any one of which seems sufficient to sink it entirely:
Is the paper general enough to apply to human neural architecture? By default I would have assumed not, even if it’s more general than just current LLMs.
Is the paper general enough to apply to human training? By default I would have assumed not. (We can perhaps consider translating the human visual field to a number of bits and taking a number of snapshots per second and considering those to be training runs, but… is there any principled reason not to instead translate to 2x or 0.5x the number of bits or snapshots per second? And that’s just the amount of data, to say nothing of how the training works.)
It seems you’re saying “at this amount of data, adding more neurons simply doesn’t help” rather than “at this amount of data and neurons, you’d prefer to add more data”. That’s different from my understanding of the paper but of course it might say that as well or instead of what I think it says.
To be clear, it seems to me that you don’t just need the paper to be giving you a scaling law that can apply to humans, with more human neurons corresponding to more artificial neurons and more human lifetime corresponding to more training data. You also need to know the conversion functions, to say “this (number of human neurons, amount of human lifetime) corresponds to this (number of artificial neurons, amount of training data)” and I’d be surprised if we can pin down the relevant values of either parameter to within an order of magnitude.
...but again, I acknowledge that you know what you’re talking about here much more than I do. And, I don’t really expect to understand if you explain, so you shouldn’t necessarily put much effort into this. But if you think I’m mistaken here, I’d appreciate a few words like “you’re wrong about the comparison I’m drawing” or “you’ve got the right idea but I think the comparison actually does work” or something, and maybe a search term I can use if I do feel like looking into it more.
Thanks for your contribution. I would also appreciate a response from Jake.
Why do you think this?
For my understanding: what is a brain arch?
The architectural design of a brain, which I think of as an prior on the weights, so I sometimes call it the architectural prior. It is encoded in the genome and is the equivalent of the high level pytorch code for a deep learning model.
If we fix the neuron/synapse/etc count (and just spread them out evenly across the volume) then length and thus power consumption of interconnect linearly scale with radius R, but the power consumption of compute units (synapses) doesn’t scale at all. Surface power density scales with R2.
This seems rather obviously incorrect to me:
There is simply a maximum amount of heat/entropy any particle of coolant fluid can extract, based on the temperature difference between the coolant particle and the compute medium
The maximum flow of coolant particles scales with the surface area.
Given a fixed compute temperature limit, coolant temp, and coolant pump rate thus results in a limit on the device radius
But obviously I do agree the brain is nowhere near the technological limits of active cooling in terms of entropy removed per unit surface area per unit time, but that’s also mostly irrelevant because you expend energy to move the heat and the brain has a small energy budget of 20W. Its coolant budget is proportional to it’s compute budget.
Moreover as you scale the volume down the coolant travels a shorter distance and has less time to reach equilibrium temp with the compute volume and thus extract the max entropy (but not sure how relevant that is at brain size scales).