Here’s something that just came to my mind: simulating a human brain is probably very parallelizable, since it has a huge number of neurons, and each neuron fires a couple hundred times per second at most. So if you have some problem which is difficult but still can be solved by one person, it’s probably more efficient to give it to one person running at 1000x speedup, not 1000 people at 1x speed who have to pay fixed costs to understand the problem and communication costs to split it up. And as computers get faster, the arithmetic keeps working—a 1M em is better than 1K 1K ems. So it seems possible that the most efficient population of ems will be quite small, one or a handful of people per data center. It’s true that as knowledge grows, more ems are needed to understand it all; but Von Neumann was a living example that one person can understand quite a lot of things, and knowledge aids like Wikipedia will certainly be much cheaper to run than ems.
In the slightly longer perspective, I expect our handful of ems to come up with enough self improvement tech, like bigger working memory or just adding more neurons, that a small population can continue to be optimal. No point paying the “fixed costs of being human” (ancestral circuitry) for billions or trillions of less efficient ems, if a smaller number of improved ems gives a better benefit ratio for that cost.
So in short, it seems to me that the world with lots and lots of ems will simply never arrive, and the whole “duplicator” concept is a bit of red herring. Instead we should imagine a world with a much smaller number of “Wise Ones”, human-derived entities with godlike speed and understanding. They will probably be quite happy about their lot in life, not miserable and exploited. And since they’ll have obvious incentive to improve coordination among themselves as well, that likely leads to the usual singleton scenario.
I don’t know if this argument is new, welcome to be shown wrong.
This seems to ignore all of the inefficiencies in parallelization.
Processors run more inefficiently the faster you run them (this is the entire reason for ‘underclocking’), so running 1 em of hardware 1000x faster will cost you >>1000x. (IIRC, Hanson has a lot of discussion of this point in Age of Em about how the cost of speed will result in tiers: some ems would run at the fastest possible frequencies but only for a tiny handful of tasks which justify the cost, somewhat analogous to HFT vs most computing tasks today—they may need 1 millisecond less latency and will pay for a continent-wide system of microwave towers and commission custom FPGAs/ASICs, but you sure don’t!)
There’s also Amdahl’s law: anything you do in parallel with n processors can be done serially with _n_x the time with zero penalty, but vice-versa is not at all true—many tasks just can’t be parallelized, or have only a few parts which can be parallelized, and the parallelization usually comes at at least some overhead (and this is in addition to the penalty you pay for running processors faster).
If there are fixed costs, it would make more sense to do something like run 1 em on a premium processor, and then fork it as soon as possible to a bunch of slow efficient processors to amortize the fixed cost; you wouldn’t fork out for an super-exotic crazy (Cray?) 1000x faster processor to do it all in one place.
To Amdahl’s law—I think simulating a brain won’t have any big serial bottlenecks. Split up by physical locality, each machine simulates a little cube of neurons and talks to machines simulating the six adjacent cubes. You can probably split one em into a million machines and get like a 500K times speedup or something. Heck, maybe even more than a million times, because each machine has better memory locality. If your intuition is different, can you explain?
To overclocking—it seems you’re saying parallelization depends on it somehow? I didn’t really understand this part.
A brain has serial bottlenecks in the form of all the communication between neurons, in the same way you can’t simply shard GPT-3-173b onto 173 billion processors to make it run 173 billion times faster. Each compute element is going to be stuck waiting on communication with the adjacent neurons. At some point, you have 1 compute node per neuron or so (this is roughly the sort of hardware you’d expect ems to run on, brain-sized neuromorphic hardware, efficiently implementing something like spiking neurons), and almost all the time is spent idle waiting for inputs/outputs. At that point, you have saturated your available parallelism and Amdahl’s law rules. Then there’s no easy way to apply more parallelism: if you have some big chunks of brains which don’t need to communicate much and so can be parallelized for performance gains… Then you just have multiple brains.
To overclocking—it seems you’re saying parallelization depends on it somehow? I didn’t really understand this part.
At that point, you have saturated your available parallelism and Amdahl’s law rules. [...] Then you just have multiple brains.
I think the point (or in any case my takeaway) is that this might be Giant Cheesecake Fallacy. Initially, there’s not enough hardware for running just a single em on the whole cluster to become wasteful, so this is what happens instead of running more ems slower, since serial work is more valuable. By the time you run into the limits of how much one em can be parallelized, the parallelized ems have long invented a process for making their brains bigger, making use of more nodes, preserving the regime of there only being a few ems who run on most of the hardware. This is more about personal identity of the ems than computing architecture, as a way of “making brains bigger” may well look like “multiple brains”, they are just brains of a single em, not multiple ems or multiple instances of an em.
My point is, the whole “age of em” might well come and go in the following regime: many neurons per processor, many processors per em, few ems per data center. In this regime, adding more processors to an em speeds up their subjective time almost linearly. You may ask, how can “few ems per data center” stay true? First of all, today’s data centers are like 100K processors, while one em has 100B neurons and way more synapses, so adding processors will make sense for quite awhile. Second of all, it won’t take that much subjective time for a handful of Von Neumann-smart ems to figure out how to scale themselves to more neurons per em, allowing “few, smarter ems per data center” to go on longer, which then leads smoothly to the post-em regime.
Also your mentions of clock speed are still puzzling to me. My whole argument still works if there’s only ever one type of processors with one clock speed fixed in stone.
First of all, today’s data centers are like 100K processors, while one em has 100B neurons and way more synapses, so adding processors will make sense for quite awhile.
Today’s data centers are completely incapable of running whole brains. We’re discussing extremely hypothetical hardware here, so what today’s data centers do is at best a loose analogy. The closest we have today is GPUs and neuromorphic hardware designed to implement neurons at the hardware level. GPUs already are a big pain to run efficiently in clusters because lack of parallelization means that communication between nodes is a major bottleneck, and communication within GPUs between layers is also a bottleneck. And neuromorphic hardware (or something like Cerebras) shows that you can create a lot of neurons at the hardware level; it’s not an area I follow in any particular detail, but for example, IBM’s Loihi chip implements 1,024 individual “spiking neural units” per core, 128 cores per chip, and they combine them in racks of 64 chips maxing out at 768 for a total of 100 million hardware neurons—so we are already far beyond any ’100k processors’ in terms of total compute elements. I suppose we could wind up having relatively few but very powerful serial compute elements for the first em, but given how strong the pressures have been to go as parallel as possible as soon as possible, I don’t see much reason to expect a ‘serial overhang’.
Okay, yeah, I had no idea that this much parallelism already existed. There could be still a reason for serial overhang (serial algorithms have more clever optimizations open to them, and neurons firing could be quite sparse at any given moment), but I’m no longer sure things will play out this way.
You seem to be talking about a compute-dominated process, with almost perfect data locality. I suspect that brain emulation may be almost entirely communication-dominated with poor locality and (comparatively) very little compute. Most neurons in the brain have a great many synapses, and the graph of connections has relatively small diameter.
So emulating any substantial part of a human brain may well need data from most of the brain every “tick”. Suppose emulating a brain in real time takes 10 units per second of compute, and 1 unit per second of data bandwidth (in convenient units where a compute node has 10 units per second of each). So a single node is bottlenecked on compute and can only run at real time.
To achieve 2x speed you can run on two nodes to get the 20 units per second of compute capability, but your data bandwidth requirement is now 4 units/second: both the nodes need full access to the data, and they need to get it done in half the time. After 3x speed-up, there is no more benefit to adding nodes. They all hit their I/O capacity, and adding more will just slow them all down due to them all needing to access every node’s data every tick.
This is even making the generous assumption that links between nodes have the same capacity and no more latency or coordination issues than a single node accessing its own local data.
I’ve obviously just made up numbers to demonstrate scaling problems in an easy way here. The real numbers will depend upon things we still don’t know about brain architecture, and on future technology. The principle remains the same, though: different resource requirements scale in different ways, which yields a “most efficient” speed for given resource constraints, and it likely won’t be at all cost-effective to vary from that by an order of magnitude in either direction.
Yeah, maybe my intuition was pointing a different way: that the brain is a physical object, physics is local, and the particular physics governing the brain seems to be very local (signals travel at tens of meters per second). And signals from one part of the brain to another have to cross the intervening space. So if we divide the brain into thousands of little cubes, then each one only needs to be connected to its six neighbors, while having plenty of interesting stuff going inside—rewiring and so on.
Edit: maybe another aspect of my intuition is that “tick” isn’t really a thing. Each little cube gets a constant stream of incoming activations, at time resolution much higher than typical firing time of one neuron, and generates a corresponding outgoing stream. Generating the outgoing stream requires simulating everything in the cube (at similar high time resolution), and doesn’t need any other information from the rest of the brain, except the incoming stream.
Thanks, making use of the relatively low propagation speed hadn’t occurred to me.
That would indeed reduce the scaling of data bandwidth significantly. It would still exist, just be not quite as severe. Area versus volume scaling still means that bandwidth dominates compute as speeds increase (with volume emulated per node decreasing), just not quite as rapidly.
I didn’t mean “tick” as a literal physical thing that happens in brains, just a term for whatever time scale governs the emulation updates.
Here’s something that just came to my mind: simulating a human brain is probably very parallelizable, since it has a huge number of neurons, and each neuron fires a couple hundred times per second at most. So if you have some problem which is difficult but still can be solved by one person, it’s probably more efficient to give it to one person running at 1000x speedup, not 1000 people at 1x speed who have to pay fixed costs to understand the problem and communication costs to split it up. And as computers get faster, the arithmetic keeps working—a 1M em is better than 1K 1K ems. So it seems possible that the most efficient population of ems will be quite small, one or a handful of people per data center. It’s true that as knowledge grows, more ems are needed to understand it all; but Von Neumann was a living example that one person can understand quite a lot of things, and knowledge aids like Wikipedia will certainly be much cheaper to run than ems.
In the slightly longer perspective, I expect our handful of ems to come up with enough self improvement tech, like bigger working memory or just adding more neurons, that a small population can continue to be optimal. No point paying the “fixed costs of being human” (ancestral circuitry) for billions or trillions of less efficient ems, if a smaller number of improved ems gives a better benefit ratio for that cost.
So in short, it seems to me that the world with lots and lots of ems will simply never arrive, and the whole “duplicator” concept is a bit of red herring. Instead we should imagine a world with a much smaller number of “Wise Ones”, human-derived entities with godlike speed and understanding. They will probably be quite happy about their lot in life, not miserable and exploited. And since they’ll have obvious incentive to improve coordination among themselves as well, that likely leads to the usual singleton scenario.
I don’t know if this argument is new, welcome to be shown wrong.
This seems to ignore all of the inefficiencies in parallelization.
Processors run more inefficiently the faster you run them (this is the entire reason for ‘underclocking’), so running 1 em of hardware 1000x faster will cost you >>1000x. (IIRC, Hanson has a lot of discussion of this point in Age of Em about how the cost of speed will result in tiers: some ems would run at the fastest possible frequencies but only for a tiny handful of tasks which justify the cost, somewhat analogous to HFT vs most computing tasks today—they may need 1 millisecond less latency and will pay for a continent-wide system of microwave towers and commission custom FPGAs/ASICs, but you sure don’t!)
There’s also Amdahl’s law: anything you do in parallel with n processors can be done serially with _n_x the time with zero penalty, but vice-versa is not at all true—many tasks just can’t be parallelized, or have only a few parts which can be parallelized, and the parallelization usually comes at at least some overhead (and this is in addition to the penalty you pay for running processors faster).
If there are fixed costs, it would make more sense to do something like run 1 em on a premium processor, and then fork it as soon as possible to a bunch of slow efficient processors to amortize the fixed cost; you wouldn’t fork out for an super-exotic crazy (Cray?) 1000x faster processor to do it all in one place.
To Amdahl’s law—I think simulating a brain won’t have any big serial bottlenecks. Split up by physical locality, each machine simulates a little cube of neurons and talks to machines simulating the six adjacent cubes. You can probably split one em into a million machines and get like a 500K times speedup or something. Heck, maybe even more than a million times, because each machine has better memory locality. If your intuition is different, can you explain?
To overclocking—it seems you’re saying parallelization depends on it somehow? I didn’t really understand this part.
A brain has serial bottlenecks in the form of all the communication between neurons, in the same way you can’t simply shard GPT-3-173b onto 173 billion processors to make it run 173 billion times faster. Each compute element is going to be stuck waiting on communication with the adjacent neurons. At some point, you have 1 compute node per neuron or so (this is roughly the sort of hardware you’d expect ems to run on, brain-sized neuromorphic hardware, efficiently implementing something like spiking neurons), and almost all the time is spent idle waiting for inputs/outputs. At that point, you have saturated your available parallelism and Amdahl’s law rules. Then there’s no easy way to apply more parallelism: if you have some big chunks of brains which don’t need to communicate much and so can be parallelized for performance gains… Then you just have multiple brains.
Increasing clock speed has superlinear costs.
I think the point (or in any case my takeaway) is that this might be Giant Cheesecake Fallacy. Initially, there’s not enough hardware for running just a single em on the whole cluster to become wasteful, so this is what happens instead of running more ems slower, since serial work is more valuable. By the time you run into the limits of how much one em can be parallelized, the parallelized ems have long invented a process for making their brains bigger, making use of more nodes, preserving the regime of there only being a few ems who run on most of the hardware. This is more about personal identity of the ems than computing architecture, as a way of “making brains bigger” may well look like “multiple brains”, they are just brains of a single em, not multiple ems or multiple instances of an em.
My point is, the whole “age of em” might well come and go in the following regime: many neurons per processor, many processors per em, few ems per data center. In this regime, adding more processors to an em speeds up their subjective time almost linearly. You may ask, how can “few ems per data center” stay true? First of all, today’s data centers are like 100K processors, while one em has 100B neurons and way more synapses, so adding processors will make sense for quite awhile. Second of all, it won’t take that much subjective time for a handful of Von Neumann-smart ems to figure out how to scale themselves to more neurons per em, allowing “few, smarter ems per data center” to go on longer, which then leads smoothly to the post-em regime.
Also your mentions of clock speed are still puzzling to me. My whole argument still works if there’s only ever one type of processors with one clock speed fixed in stone.
Today’s data centers are completely incapable of running whole brains. We’re discussing extremely hypothetical hardware here, so what today’s data centers do is at best a loose analogy. The closest we have today is GPUs and neuromorphic hardware designed to implement neurons at the hardware level. GPUs already are a big pain to run efficiently in clusters because lack of parallelization means that communication between nodes is a major bottleneck, and communication within GPUs between layers is also a bottleneck. And neuromorphic hardware (or something like Cerebras) shows that you can create a lot of neurons at the hardware level; it’s not an area I follow in any particular detail, but for example, IBM’s Loihi chip implements 1,024 individual “spiking neural units” per core, 128 cores per chip, and they combine them in racks of 64 chips maxing out at 768 for a total of 100 million hardware neurons—so we are already far beyond any ’100k processors’ in terms of total compute elements. I suppose we could wind up having relatively few but very powerful serial compute elements for the first em, but given how strong the pressures have been to go as parallel as possible as soon as possible, I don’t see much reason to expect a ‘serial overhang’.
Okay, yeah, I had no idea that this much parallelism already existed. There could be still a reason for serial overhang (serial algorithms have more clever optimizations open to them, and neurons firing could be quite sparse at any given moment), but I’m no longer sure things will play out this way.
You seem to be talking about a compute-dominated process, with almost perfect data locality. I suspect that brain emulation may be almost entirely communication-dominated with poor locality and (comparatively) very little compute. Most neurons in the brain have a great many synapses, and the graph of connections has relatively small diameter.
So emulating any substantial part of a human brain may well need data from most of the brain every “tick”. Suppose emulating a brain in real time takes 10 units per second of compute, and 1 unit per second of data bandwidth (in convenient units where a compute node has 10 units per second of each). So a single node is bottlenecked on compute and can only run at real time.
To achieve 2x speed you can run on two nodes to get the 20 units per second of compute capability, but your data bandwidth requirement is now 4 units/second: both the nodes need full access to the data, and they need to get it done in half the time. After 3x speed-up, there is no more benefit to adding nodes. They all hit their I/O capacity, and adding more will just slow them all down due to them all needing to access every node’s data every tick.
This is even making the generous assumption that links between nodes have the same capacity and no more latency or coordination issues than a single node accessing its own local data.
I’ve obviously just made up numbers to demonstrate scaling problems in an easy way here. The real numbers will depend upon things we still don’t know about brain architecture, and on future technology. The principle remains the same, though: different resource requirements scale in different ways, which yields a “most efficient” speed for given resource constraints, and it likely won’t be at all cost-effective to vary from that by an order of magnitude in either direction.
Yeah, maybe my intuition was pointing a different way: that the brain is a physical object, physics is local, and the particular physics governing the brain seems to be very local (signals travel at tens of meters per second). And signals from one part of the brain to another have to cross the intervening space. So if we divide the brain into thousands of little cubes, then each one only needs to be connected to its six neighbors, while having plenty of interesting stuff going inside—rewiring and so on.
Edit: maybe another aspect of my intuition is that “tick” isn’t really a thing. Each little cube gets a constant stream of incoming activations, at time resolution much higher than typical firing time of one neuron, and generates a corresponding outgoing stream. Generating the outgoing stream requires simulating everything in the cube (at similar high time resolution), and doesn’t need any other information from the rest of the brain, except the incoming stream.
Thanks, making use of the relatively low propagation speed hadn’t occurred to me.
That would indeed reduce the scaling of data bandwidth significantly. It would still exist, just be not quite as severe. Area versus volume scaling still means that bandwidth dominates compute as speeds increase (with volume emulated per node decreasing), just not quite as rapidly.
I didn’t mean “tick” as a literal physical thing that happens in brains, just a term for whatever time scale governs the emulation updates.