Gerald Monroe comments on Thoughts on hardware /​ compute requirements for AGI

• Please read a neuroscience book, even an introductory one, on how a synapse works. Just 1 chapter, even.

There’s a MAC in there. It’s because the incoming action potential hits the synapse, and sends a certain quantity of neurotransmitters across a gap. The sender cell can vary how much neurotransmitter it sends, and the receiving cell can vary how many active receptors it has. The type of neurotransmitter determines the gain and sign. (this is like the exponent and sign bit for 8 bit BFloat)

These 2 variables can be combined to a single coefficient, you can think of it as “voltage delta” (it can be + or -)

So it’s (1) * (voltage gain) = change in target cell voltage.

For ANN, it’s <activation output> * <weight> = change in target node activation input.

The brain also uses timing to get more information than just “1”, the exact time the pulse arrived matters to a certain amount of resolution. It is NOT infinite, for reasons I can explain if you want.

So the final equation is (1) * (synapse state) * (voltage gain) = change in target cell voltage.

Aka you have to multiply 2 numbers together and add, which is what “multiply-accumulate” units do.

Due to all the horrible electrical noise in the brain, and biological forms of noise and contaminants, and other factors, this is the reason for me making it only 8 bits − 1 part in 256 - of precision. That’s realistically probably generous, it’s probably not even that good.

There is immense amounts of additional complexity in the brain, but almost none of this matters for determining inference outputs. The action potentials rush out of the synapse at kilometers per second—many biological processes just don’t matter at all because of this. Same how a transistor’s behavior is irrelevant, it’s a cartoon switch.

For training, sure, if we wanted a system to work like a brain we’d have to model some of this, but we don’t. We can train using whatever algorithm measurably is optimal.

Similarly we never have to bother with a “minicolumn”. We only care about what works best. Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.

We probably will find something way better than a minicolumn. Some argue that’s what a transformer is.

• I’ve spent thousands of hours reading neuroscience papers, I know how synapses work, jeez :-P

Similarly we never have to bother with a “minicolumn”. We only care about what works best. Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.

We probably will find something way better than a minicolumn. Some argue that’s what a transformer is.

I’m sorta confused that you wrote all these paragraphs with (as I understand it) the message that if we want future AGI algorithms to do the same things that a brain can do, then it needs to do MAC operations in the same way that (you claim) brain synapses do, and it needs to have 68 TB of weight storage just as (you claim) the brain does. …But then here at the end you seem to do a 180° flip and talk about flapping wings and transformers and “We probably will find something way better”. OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that? (Maybe you have a “scale-is-all-you-need” perspective, and you note that we don’t have AGI yet, and therefore the explanation must be “insufficient scale”? Or something else?)

There’s a MAC in there.

OK, imagine for the sake of argument that we live in the following world (a caricatured version of this model):

• Dendrites have lots of clusters of 10 nearby synapses

• Iff all 10 synapses within one cluster get triggered simultaneously, then it triggers a dendritic spike on the downstream neuron.

• Different clusters on the same dendritic tree can each be treated independently

• As background, the whole dendrite doesn’t have a single voltage (let alone the whole dendritic tree). Dendrites have different voltages in different places. If there are multiple synaptic firings that are very close in both time and space, then the voltages can add up and get past the spike threshold; but if multiple synapses that are very far apart from each other fire simultaneously, they don’t add up, they each affect the voltage in their own little area, and it doesn’t create a dendritic spike.

• The upstream neurons are all firing on a regular clock cycle, such that the synapse firing is either “simultaneous” or “so far apart in time that we can treat each timestep independently”.

In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?

Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when in fact spatially-distant synapses can’t collaboratively create a spike.

If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.

And by the way, I think we could reformulate this same algorithm to have a very different low-level implementation (but the same input and output), by replacing “groups of neurons that form clusters together” with “serial numbers”. Then there would be no MACs and there would be no multi-synapse ANDs, but rather there would be various hash tables or something, I dunno. And the memory requirements would be different, as would the number of required operations, presumably.

At this point maybe you’re going to reply “OK but that’s an imaginary world, whereas I want to talk about the real world.” Certainly the bullet points above are erasing real-world complexities. But it’s very difficult to judge which real-world complexities are actually playing an important role in brain algorithms and which aren’t. For example, should we treat (certain classes of) cortical synapses as having binary strength rather than smoothly-varying strength? That’s a longstanding controversy! Do neurons really form discrete and completely-noninteracting clusters on dendrites? I doubt it…but maybe the brain would work better if they did!! What about all the other things going on in the cortex? That’s a hard question. There are definitely other things going on unrelated to this particular model, but it’s controversial exactly what they are.

• In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?

Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when spatially-distant synapses can’t collaboratively create a spike.

If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.

Sure. I was focused on what I thought was the minimum computationally relevant model. As in we model every effect that matters to whether ultimately a synapse will fire or not to a good enough level that it’s within the noise threshold.

OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that?

I think I was just trying to fill in the rest of your cartoon model. No, we probably don’t need exactly that much memory, but I addressed your misconceptions about repeated algorithms applied across parallel input. You do need a copy in memory of every repeat. 4090 will not cut it.

If you wanted a tighter model, you might ask “ok how much of the brain is speech processing vs vision and robotics control”. Then you can estimate how much bigger GPT-3 has to be to also run a robot to human levels of dexterity and see.

Right now GPT-3 is 175 billion params, or 350 gigs in 16-bit. So you need something like 4-8 A/​H 100s to run it. I think above I said 48 cards to hit brain level compute, and if we end up needing a lot of extra memory, 960.

With current cards that exist.

So you can optimize this down a lot, but probably not to a 4090. One way to optimize is to have a cognitive architecture made of many specialized networks, and only load the ones relevant for the current task. So the AGI needs time to “context switch” by loading the set of networks it needs from storage to memory.

Instant switching can be done as well at a larger datacenter level with a fairly obvious algorithm.