I’ve spent thousands of hours reading neuroscience papers, I know how synapses work, jeez :-P
Similarly we never have to bother with a “minicolumn”. We only care about what works best. Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.
We probably will find something way better than a minicolumn. Some argue that’s what a transformer is.
I’m sorta confused that you wrote all these paragraphs with (as I understand it) the message that if we want future AGI algorithms to do the same things that a brain can do, then it needs to do MAC operations in the same way that (you claim) brain synapses do, and it needs to have 68 TB of weight storage just as (you claim) the brain does. …But then here at the end you seem to do a 180° flip and talk about flapping wings and transformers and “We probably will find something way better”. OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that? (Maybe you have a “scale-is-all-you-need” perspective, and you note that we don’t have AGI yet, and therefore the explanation must be “insufficient scale”? Or something else?)
There’s a MAC in there.
OK, imagine for the sake of argument that we live in the following world (a caricatured version of this model):
Dendrites have lots of clusters of 10 nearby synapses
Iff all 10 synapses within one cluster get triggered simultaneously, then it triggers a dendritic spike on the downstream neuron.
Different clusters on the same dendritic tree can each be treated independently
As background, the whole dendrite doesn’t have a single voltage (let alone the whole dendritic tree). Dendrites have different voltages in different places. If there are multiple synaptic firings that are very close in both time and space, then the voltages can add up and get past the spike threshold; but if multiple synapses that are very far apart from each other fire simultaneously, they don’t add up, they each affect the voltage in their own little area, and it doesn’t create a dendritic spike.
The upstream neurons are all firing on a regular clock cycle, such that the synapse firing is either “simultaneous” or “so far apart in time that we can treat each timestep independently”.
In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?
Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when in fact spatially-distant synapses can’t collaboratively create a spike.
If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.
And by the way, I think we could reformulate this same algorithm to have a very different low-level implementation (but the same input and output), by replacing “groups of neurons that form clusters together” with “serial numbers”. Then there would be no MACs and there would be no multi-synapse ANDs, but rather there would be various hash tables or something, I dunno. And the memory requirements would be different, as would the number of required operations, presumably.
At this point maybe you’re going to reply “OK but that’s an imaginary world, whereas I want to talk about the real world.” Certainly the bullet points above are erasing real-world complexities. But it’s very difficult to judge which real-world complexities are actually playing an important role in brain algorithms and which aren’t. For example, should we treat (certain classes of) cortical synapses as having binary strength rather than smoothly-varying strength? That’s a longstanding controversy! Do neurons really form discrete and completely-noninteracting clusters on dendrites? I doubt it…but maybe the brain would work better if they did!! What about all the other things going on in the cortex? That’s a hard question. There are definitely other things going on unrelated to this particular model, but it’s controversial exactly what they are.
In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?
Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when spatially-distant synapses can’t collaboratively create a spike.
If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.
Sure. I was focused on what I thought was the minimum computationally relevant model. As in we model every effect that matters to whether ultimately a synapse will fire or not to a good enough level that it’s within the noise threshold.
OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that?
I think I was just trying to fill in the rest of your cartoon model. No, we probably don’t need exactly that much memory, but I addressed your misconceptions about repeated algorithms applied across parallel input. You do need a copy in memory of every repeat. 4090 will not cut it.
If you wanted a tighter model, you might ask “ok how much of the brain is speech processing vs vision and robotics control”. Then you can estimate how much bigger GPT-3 has to be to also run a robot to human levels of dexterity and see.
Right now GPT-3 is 175 billion params, or 350 gigs in 16-bit. So you need something like 4-8 A/H 100s to run it. I think above I said 48 cards to hit brain level compute, and if we end up needing a lot of extra memory, 960.
With current cards that exist.
So you can optimize this down a lot, but probably not to a 4090. One way to optimize is to have a cognitive architecture made of many specialized networks, and only load the ones relevant for the current task. So the AGI needs time to “context switch” by loading the set of networks it needs from storage to memory.
Instant switching can be done as well at a larger datacenter level with a fairly obvious algorithm.
I’ve spent thousands of hours reading neuroscience papers, I know how synapses work, jeez :-P
I’m sorta confused that you wrote all these paragraphs with (as I understand it) the message that if we want future AGI algorithms to do the same things that a brain can do, then it needs to do MAC operations in the same way that (you claim) brain synapses do, and it needs to have 68 TB of weight storage just as (you claim) the brain does. …But then here at the end you seem to do a 180° flip and talk about flapping wings and transformers and “We probably will find something way better”. OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that? (Maybe you have a “scale-is-all-you-need” perspective, and you note that we don’t have AGI yet, and therefore the explanation must be “insufficient scale”? Or something else?)
OK, imagine for the sake of argument that we live in the following world (a caricatured version of this model):
Dendrites have lots of clusters of 10 nearby synapses
Iff all 10 synapses within one cluster get triggered simultaneously, then it triggers a dendritic spike on the downstream neuron.
Different clusters on the same dendritic tree can each be treated independently
As background, the whole dendrite doesn’t have a single voltage (let alone the whole dendritic tree). Dendrites have different voltages in different places. If there are multiple synaptic firings that are very close in both time and space, then the voltages can add up and get past the spike threshold; but if multiple synapses that are very far apart from each other fire simultaneously, they don’t add up, they each affect the voltage in their own little area, and it doesn’t create a dendritic spike.
The upstream neurons are all firing on a regular clock cycle, such that the synapse firing is either “simultaneous” or “so far apart in time that we can treat each timestep independently”.
In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?
Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when in fact spatially-distant synapses can’t collaboratively create a spike.
If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.
And by the way, I think we could reformulate this same algorithm to have a very different low-level implementation (but the same input and output), by replacing “groups of neurons that form clusters together” with “serial numbers”. Then there would be no MACs and there would be no multi-synapse ANDs, but rather there would be various hash tables or something, I dunno. And the memory requirements would be different, as would the number of required operations, presumably.
At this point maybe you’re going to reply “OK but that’s an imaginary world, whereas I want to talk about the real world.” Certainly the bullet points above are erasing real-world complexities. But it’s very difficult to judge which real-world complexities are actually playing an important role in brain algorithms and which aren’t. For example, should we treat (certain classes of) cortical synapses as having binary strength rather than smoothly-varying strength? That’s a longstanding controversy! Do neurons really form discrete and completely-noninteracting clusters on dendrites? I doubt it…but maybe the brain would work better if they did!! What about all the other things going on in the cortex? That’s a hard question. There are definitely other things going on unrelated to this particular model, but it’s controversial exactly what they are.
In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?
Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when spatially-distant synapses can’t collaboratively create a spike.
If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.
Sure. I was focused on what I thought was the minimum computationally relevant model. As in we model every effect that matters to whether ultimately a synapse will fire or not to a good enough level that it’s within the noise threshold.
OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that?
I think I was just trying to fill in the rest of your cartoon model. No, we probably don’t need exactly that much memory, but I addressed your misconceptions about repeated algorithms applied across parallel input. You do need a copy in memory of every repeat. 4090 will not cut it.
If you wanted a tighter model, you might ask “ok how much of the brain is speech processing vs vision and robotics control”. Then you can estimate how much bigger GPT-3 has to be to also run a robot to human levels of dexterity and see.
Right now GPT-3 is 175 billion params, or 350 gigs in 16-bit. So you need something like 4-8 A/H 100s to run it. I think above I said 48 cards to hit brain level compute, and if we end up needing a lot of extra memory, 960.
With current cards that exist.
So you can optimize this down a lot, but probably not to a 4090. One way to optimize is to have a cognitive architecture made of many specialized networks, and only load the ones relevant for the current task. So the AGI needs time to “context switch” by loading the set of networks it needs from storage to memory.
Instant switching can be done as well at a larger datacenter level with a fairly obvious algorithm.