I’m not sure the complexity of a human brain is necessarily bounded by the size of the human genome. Instead of interpreting DNA as containing the full description, I think treating it as the seed of a procedurally generated organism may be more accurate. You can’t reconstruct an organism from DNA without an algorithm for interpreting it. Such an algorithm contains more complexity than the DNA itself; the protein folding problem is just one piece of it.
“Procedural generation” can’t create useful design information from thin air. For example, Minecraft worlds are procedurally generated with a seed. If I have in mind some useful configuration of Minecraft stuff that takes 100 bits to specify, then I probably need to search through 2^100 different seeds on average, or thereabouts, before I find one with that specific configuration at a particular pre-specified coordinate.
The thing is: the map from seeds to outputs (Minecraft worlds) might be complicated, but it’s not complicated in a way that generates useful design information from thin air.
By the same token, the map from DNA to folded proteins is rather complicated to simulate on a computer, but it’s not complicated in a way that generates useful design information from thin air. Random DNA creates random proteins. These random proteins fold in a hard-to-simulate way, as always, but the end-result configuration is useless. Thus, the design information all has to be in the DNA. The more specific you are about what such-and-such protein ought to do, the more possible DNA configurations you need to search through before you find one that encodes a protein with that property. The complexity of protein folding doesn’t change that—it just makes it so that the “right” DNA in the search space is obfuscated. You still need a big search space commensurate with the design specificity.
By contrast, here’s a kernel of truth adjacent to your comment: It is certainly possible for DNA to build a within-lifetime learning algorithm, and then for that within-lifetime learning algorithm to wind up (after months or years or decades) containing much more useful information than was in the DNA. By analogy, it’s very common for an ML source code repository to have much less information in its code, than the information that will eventually be stored in the weights of the trained model built by that code. (The latter can be in the terabytes.) Same idea.
Unlike protein folding, running a within-lifetime learning algorithm does generate new useful information. That’s their whole point.
Yes. But a specific minecraft world (if we ignore the fact that it’s pseudorandom) can be more complicated than the minecraft program itself.
Given a fixed genome, it can develop into many different potential people, depending on both life experiences and neuro-developmental RNG.
And in some sense “useful complexity” is a self contradictory concept. If the goal is simple, then a brute force search set to maximize the goal is a simple program. Sure, the result may look very “complicated”, but it has low komolgorov complexity.
Thus, the design information all has to be in the DNA
The OP mentioned non-DNA sources of information briefly, but I still feel like they’re not being given enough weight.
In order to fully define e.g. a human, you need to specify:
The DNA
A full specification of the egg where the DNA will start its life
A full specification of the womb in which the egg will grow into a human
If you gave a piece of DNA to an alien and didn’t tell them how to interpret it, then they’d have no way of building a human. You’d need to give them a whole lot of other information too.
Even looking at different DNA for different organisms, each organism’s DNA expects to be interpreted differently (as opposed to source code, which mostly intends to be interpreted by the same OS/hardware as other source code). If you put a lizard’s DNA into a human’s egg and womb, I’m guessing that would not successfully build a lizard.
So I guess my question is: to what extent should the complexity of the interpreter be included in the complexity of the thing-being-interpreted? In one sense I feel like Word’s code does fully specify Word amongst all other possible software, but in another sense (including the interpreter) I feel like it does not.
If a spy slips a piece of paper to his handler, and then the counter-espionage officer arrests them and gets the piece of paper, and the piece of paper just says “85”, then I don’t know wtf that means, but I do learn something like “the spy is not communicating all that much information that his superiors don’t already know”.
By the same token, if you say that humans have 25,000 genes (or whatever), that says something important about how many specific things the genome designed in the human brain and body. For example, there’s something in the brain that says “if I’m malnourished, then reduce the rate of the (highly-energy-consuming) nonshivering thermogenesis process”. It’s a specific innate (not learned) connection between two specific neuron groups in different parts of the brain, I think one in the arcuate nucleus of the hypothalamus, the other in the periaqueductal gray of the brainstem (two of many hundreds or low-thousands of little idiosyncratic cell groups in the hypothalamus and brainstem). There’s nothing in the central dogma of molecular biology, and there’s nothing in the chemical nature of proteins, that makes this particular connection especially prone to occurring, compared to the huge number of superficially-similar connections that would be maladaptive (“if I’m malnourished, then get goosebumps” or whatever). So this connection must be occupying some number of bits of DNA—perhaps not a whole dedicated protein, but perhaps some part of some protein, or whatever. And there can only be so many of that type of thing, given a mere 25,000 genes for the whole body and everything in it.
That’s an important thing that you can learn from the size of the genome. We can learn it without expecting aliens to be able to decode DNA or anything like that. And Archimedes’s comment above doesn’t undermine it—it’s a conclusion that’s robust to the “procedural generation” complexities of how the embryonic development process unfolds.
This. Though I don’t think the interpretation algorithm is the source of most of the specification bits here.
To make an analogy with artificial neural networks, the human genome needs to contain a specification of the architecture, the training signal and update algorithm, and some basic circuitry that has to work from the start, like breathing. Everything else can be learned.
I think the point maybe holds up slightly better for non-brain animal parts, but there’s still a difference between storing a blueprint for what proteins cells are supposed to make and when, and storing the complete body plan of the resulting adult organism. The latter seems like a closer match to a Microsoft Word file.
EDIT: Actually, maybe that’d be overcounting what the genome needs to store as well. How individual butterfly bodies grow might be determined by the environment, meaning some of their complexity would actually be specified by the environment, just as in the case of adult butterfly brains. Since this could be highly systematic (the relevant parts of the environment are nigh-identical for all butterflies), those bits would not be captured in our sample of butterfly variation.
Up to the bits of genome description length that vary between individual butterflies, which I’d guess would be small compared to both the bits specifying the butterfly species and the bits specifying details of the procedural generation outcome in individual butterflies?
Yeah. I think the part of the DNA specifying the brain is comparable to something like the training algorithm + initial weights of an LLM. I don’t know how much space those would take if compressed, but probably very little, with the resulting model being much bigger than that. (And the model is in turn much smaller than the set of training data that went into creating it.)
Page 79-80 of the Whole Brain Emulation roadmap gave estimated storage requirements for uploading a human brain. The estimate depends on what we expect to be the scale on which the brain needs to be emulated. Workshop consensus at the time was that the most likely scale would be level 4-6 (see p. 13-14). This would put the storage requirements somewhere between 8000 and 1 million terabytes.
I’m not sure the complexity of a human brain is necessarily bounded by the size of the human genome. Instead of interpreting DNA as containing the full description, I think treating it as the seed of a procedurally generated organism may be more accurate. You can’t reconstruct an organism from DNA without an algorithm for interpreting it. Such an algorithm contains more complexity than the DNA itself; the protein folding problem is just one piece of it.
“Procedural generation” can’t create useful design information from thin air. For example, Minecraft worlds are procedurally generated with a seed. If I have in mind some useful configuration of Minecraft stuff that takes 100 bits to specify, then I probably need to search through 2^100 different seeds on average, or thereabouts, before I find one with that specific configuration at a particular pre-specified coordinate.
The thing is: the map from seeds to outputs (Minecraft worlds) might be complicated, but it’s not complicated in a way that generates useful design information from thin air.
By the same token, the map from DNA to folded proteins is rather complicated to simulate on a computer, but it’s not complicated in a way that generates useful design information from thin air. Random DNA creates random proteins. These random proteins fold in a hard-to-simulate way, as always, but the end-result configuration is useless. Thus, the design information all has to be in the DNA. The more specific you are about what such-and-such protein ought to do, the more possible DNA configurations you need to search through before you find one that encodes a protein with that property. The complexity of protein folding doesn’t change that—it just makes it so that the “right” DNA in the search space is obfuscated. You still need a big search space commensurate with the design specificity.
By contrast, here’s a kernel of truth adjacent to your comment: It is certainly possible for DNA to build a within-lifetime learning algorithm, and then for that within-lifetime learning algorithm to wind up (after months or years or decades) containing much more useful information than was in the DNA. By analogy, it’s very common for an ML source code repository to have much less information in its code, than the information that will eventually be stored in the weights of the trained model built by that code. (The latter can be in the terabytes.) Same idea.
Unlike protein folding, running a within-lifetime learning algorithm does generate new useful information. That’s their whole point.
Yes. But a specific minecraft world (if we ignore the fact that it’s pseudorandom) can be more complicated than the minecraft program itself.
Given a fixed genome, it can develop into many different potential people, depending on both life experiences and neuro-developmental RNG.
And in some sense “useful complexity” is a self contradictory concept. If the goal is simple, then a brute force search set to maximize the goal is a simple program. Sure, the result may look very “complicated”, but it has low komolgorov complexity.
The OP mentioned non-DNA sources of information briefly, but I still feel like they’re not being given enough weight.
In order to fully define e.g. a human, you need to specify:
The DNA
A full specification of the egg where the DNA will start its life
A full specification of the womb in which the egg will grow into a human
If you gave a piece of DNA to an alien and didn’t tell them how to interpret it, then they’d have no way of building a human. You’d need to give them a whole lot of other information too.
Even looking at different DNA for different organisms, each organism’s DNA expects to be interpreted differently (as opposed to source code, which mostly intends to be interpreted by the same OS/hardware as other source code). If you put a lizard’s DNA into a human’s egg and womb, I’m guessing that would not successfully build a lizard.
So I guess my question is: to what extent should the complexity of the interpreter be included in the complexity of the thing-being-interpreted? In one sense I feel like Word’s code does fully specify Word amongst all other possible software, but in another sense (including the interpreter) I feel like it does not.
If a spy slips a piece of paper to his handler, and then the counter-espionage officer arrests them and gets the piece of paper, and the piece of paper just says “85”, then I don’t know wtf that means, but I do learn something like “the spy is not communicating all that much information that his superiors don’t already know”.
By the same token, if you say that humans have 25,000 genes (or whatever), that says something important about how many specific things the genome designed in the human brain and body. For example, there’s something in the brain that says “if I’m malnourished, then reduce the rate of the (highly-energy-consuming) nonshivering thermogenesis process”. It’s a specific innate (not learned) connection between two specific neuron groups in different parts of the brain, I think one in the arcuate nucleus of the hypothalamus, the other in the periaqueductal gray of the brainstem (two of many hundreds or low-thousands of little idiosyncratic cell groups in the hypothalamus and brainstem). There’s nothing in the central dogma of molecular biology, and there’s nothing in the chemical nature of proteins, that makes this particular connection especially prone to occurring, compared to the huge number of superficially-similar connections that would be maladaptive (“if I’m malnourished, then get goosebumps” or whatever). So this connection must be occupying some number of bits of DNA—perhaps not a whole dedicated protein, but perhaps some part of some protein, or whatever. And there can only be so many of that type of thing, given a mere 25,000 genes for the whole body and everything in it.
That’s an important thing that you can learn from the size of the genome. We can learn it without expecting aliens to be able to decode DNA or anything like that. And Archimedes’s comment above doesn’t undermine it—it’s a conclusion that’s robust to the “procedural generation” complexities of how the embryonic development process unfolds.
This. Though I don’t think the interpretation algorithm is the source of most of the specification bits here.
To make an analogy with artificial neural networks, the human genome needs to contain a specification of the architecture, the training signal and update algorithm, and some basic circuitry that has to work from the start, like breathing. Everything else can be learned.
I think the point maybe holds up slightly better for non-brain animal parts, but there’s still a difference between storing a blueprint for what proteins cells are supposed to make and when, and storing the complete body plan of the resulting adult organism. The latter seems like a closer match to a Microsoft Word file.
If you took the adult body plans of lots of butterflies, and separated all the information of an adult butterfly bodyplan into the bits common to all of the butterflies, and the bits specifying the exact way things happened to grow in this butterfly, the former is more or less[1] what would need to fit into the butterfly genome, not the former plus the latter.
EDIT: Actually, maybe that’d be overcounting what the genome needs to store as well. How individual butterfly bodies grow might be determined by the environment, meaning some of their complexity would actually be specified by the environment, just as in the case of adult butterfly brains. Since this could be highly systematic (the relevant parts of the environment are nigh-identical for all butterflies), those bits would not be captured in our sample of butterfly variation.
Up to the bits of genome description length that vary between individual butterflies, which I’d guess would be small compared to both the bits specifying the butterfly species and the bits specifying details of the procedural generation outcome in individual butterflies?
Yeah. I think the part of the DNA specifying the brain is comparable to something like the training algorithm + initial weights of an LLM. I don’t know how much space those would take if compressed, but probably very little, with the resulting model being much bigger than that. (And the model is in turn much smaller than the set of training data that went into creating it.)
Page 79-80 of the Whole Brain Emulation roadmap gave estimated storage requirements for uploading a human brain. The estimate depends on what we expect to be the scale on which the brain needs to be emulated. Workshop consensus at the time was that the most likely scale would be level 4-6 (see p. 13-14). This would put the storage requirements somewhere between 8000 and 1 million terabytes.
the laws of physics are quite compact. and presumably most of the complexity in a zygote is in the dna.