EDIT: I updated the circuits section of the article with an improved model of Serial vs Parallel vs Neurmorphic(PIM) scalability, which better illustrates how serial computation doesn’t scale.
Yes you bring up a good point, and one I should have discussed in more detail (but the article is already pretty long). However the article does provide part of the framework to answer this question.
There definitely are serial/parallel tradeoffs where the parallel version of an algorithm tends to use marginally more compute asymptotically. However these simple big O asymptotic models do not consider the fundamental costs of wire energy transit for remote memory accesses, which actually scale as M(1/2) for 2D memory. So in that sense the simple big O models are asymptotically wrong. If you use the correct more detailed models which account for the actual wire energy costs, everything changes, and the parallel versions leveraging distributed local memory and thus avoiding wire energy transit are generally more energy efficient—but through using a more memory heavy algorithmic approach.
Another way of looking at it is to compare serial-optimized VN processors (CPUs) vs parallel-optimized VN processors (GPUs), vs parallel processor-in-memory (brains, neuromorphic).
Pure serial CPUs (ignoring parallel/vector instructions) with tens of billions of transistors have only order a few dozen cores but not much higher clock rates than GPUs, despite using all that die space for marginal serial speed increase—serial speed scales extremely poorly with transistor density, end of dennard scaling, etc. A GPU with tens of billions of transistors instead has tens of thousands of ALU cores, but is still ultimately limited by very slow poor scaling of off-chip RAM bandwidth proportional to N0.5 (where N is device area), and wire energy that doesn’t scale at all. The neuromorphic/PIM machine has perfect mem bandwidth scaling at 1:1 ratio—it can access all of it’s RAM per clock cycle, pays near zero energy to access RAM (as memory and compute are unified), and everything scales linear with N.
Physics is fundamentally parallel, not serial, so the latter just doesn’t scale.
But of course on top of all that there is latency/delay—so for example the brain is also strongly optimized for minimal depth for minimal delay, and to some extent that may compete with optimizing for energy. Ironically delay is also a problem in GPU ANNs—huge problem for tesla’s self driving cars for example—because GPUs need to operate on huge batches to amortize their very limited/expensive memory bandwidth.
Yeah, latency / depth is the main thing I was thinking of.
If my boss says “You must calculate sin(x) in 2 clock cycles”, I would have no choice but to waste a ton of memory on a giant lookup table. (Maybe “2″ is the wrong number of clock cycles here, but you get the idea.) If I’m allowed 10 clock cycles, maybe I can reduce x mod 2π first, and thus use a much smaller lookup table, thus waste a lot less memory. If I’m allowed 200 clock cycles to calculate sin(x), I can use C code that has no lookup table at all, and thus roughly zero memory and communications. (EDIT: Oops, LOL, the C code I linked uses a lookup table. I could have linked this one instead.)
So I still feel like I don’t want to take it for granted that there’s a certain amount of “algorithmic work” that needs to be done for “intelligence”, and that amount of “work” is similar to what the human brain uses. I feel like there might be potential algorithmic strategies out there that are just out of the question for the human brain, because of serial depth. (Among other reasons.)
Also, it’s not all-or-nothing: I can imagine an AGI that involves a big parallel processor, and a small fast serial coprocessor. Maybe there are little pieces of the algorithm that would massively benefit from serialization, and the brain is bottlenecked in capability (or wastes memory / resources) by the need to find workarounds for those pieces. Or maybe not, who knows.
EDIT: I updated the circuits section of the article with an improved model of Serial vs Parallel vs Neurmorphic(PIM) scalability, which better illustrates how serial computation doesn’t scale.
Yes you bring up a good point, and one I should have discussed in more detail (but the article is already pretty long). However the article does provide part of the framework to answer this question.
There definitely are serial/parallel tradeoffs where the parallel version of an algorithm tends to use marginally more compute asymptotically. However these simple big O asymptotic models do not consider the fundamental costs of wire energy transit for remote memory accesses, which actually scale as M(1/2) for 2D memory. So in that sense the simple big O models are asymptotically wrong. If you use the correct more detailed models which account for the actual wire energy costs, everything changes, and the parallel versions leveraging distributed local memory and thus avoiding wire energy transit are generally more energy efficient—but through using a more memory heavy algorithmic approach.
Another way of looking at it is to compare serial-optimized VN processors (CPUs) vs parallel-optimized VN processors (GPUs), vs parallel processor-in-memory (brains, neuromorphic).
Pure serial CPUs (ignoring parallel/vector instructions) with tens of billions of transistors have only order a few dozen cores but not much higher clock rates than GPUs, despite using all that die space for marginal serial speed increase—serial speed scales extremely poorly with transistor density, end of dennard scaling, etc. A GPU with tens of billions of transistors instead has tens of thousands of ALU cores, but is still ultimately limited by very slow poor scaling of off-chip RAM bandwidth proportional to N0.5 (where N is device area), and wire energy that doesn’t scale at all. The neuromorphic/PIM machine has perfect mem bandwidth scaling at 1:1 ratio—it can access all of it’s RAM per clock cycle, pays near zero energy to access RAM (as memory and compute are unified), and everything scales linear with N.
Physics is fundamentally parallel, not serial, so the latter just doesn’t scale.
But of course on top of all that there is latency/delay—so for example the brain is also strongly optimized for minimal depth for minimal delay, and to some extent that may compete with optimizing for energy. Ironically delay is also a problem in GPU ANNs—huge problem for tesla’s self driving cars for example—because GPUs need to operate on huge batches to amortize their very limited/expensive memory bandwidth.
Yeah, latency / depth is the main thing I was thinking of.
If my boss says “You must calculate sin(x) in 2 clock cycles”, I would have no choice but to waste a ton of memory on a giant lookup table. (Maybe “2″ is the wrong number of clock cycles here, but you get the idea.) If I’m allowed 10 clock cycles, maybe I can reduce x mod 2π first, and thus use a much smaller lookup table, thus waste a lot less memory. If I’m allowed 200 clock cycles to calculate sin(x), I can use C code that has no lookup table at all, and thus roughly zero memory and communications. (EDIT: Oops, LOL, the C code I linked uses a lookup table. I could have linked this one instead.)
So I still feel like I don’t want to take it for granted that there’s a certain amount of “algorithmic work” that needs to be done for “intelligence”, and that amount of “work” is similar to what the human brain uses. I feel like there might be potential algorithmic strategies out there that are just out of the question for the human brain, because of serial depth. (Among other reasons.)
Also, it’s not all-or-nothing: I can imagine an AGI that involves a big parallel processor, and a small fast serial coprocessor. Maybe there are little pieces of the algorithm that would massively benefit from serialization, and the brain is bottlenecked in capability (or wastes memory / resources) by the need to find workarounds for those pieces. Or maybe not, who knows.