(I’m going to include text from this other comment of yours so that I can respond to thematically similar things in one spot.)
I think we have different perspectives on what counts as “training” in the case of human evolution. I think of human within lifetime experiences as the training data, and I don’t include the evolutionary history in the training data.
Agreed that our perspectives differ. According to me, there are two different things going on (the self-organization of the brain during a lifetime and the natural selection over genes across lifetimes), both of which can be thought of as ‘training’.
Evolution then marginally updates the data labeling functions for the next generation. It’s is a fundamentally differenttype of thing than an individual deep learning training run.
It seems to me like if we’re looking at a particular deployed LLM, there are two main timescales, the long foundational training and then the contextual deployment. I think this looks a lot like having an evolutionary history of lots of ‘lifetimes’ which are relevant solely due to how they impact your process-of-interpreting-input, followed by your current lifetime in which you interpret some input.
That is, the ‘lifetime of learning’ that humans have corresponds to whatever’s going on inside the transformer as it reads thru a prompt context window. It probably includes some stuff like gradient descent, but not cleanly, and certainly not done by the outside operators.
What the outside operators can do is more like “run more generations”, often with deliberately chosen environments. [Of course SGD is different from genetic algorithms, and so the analogy isn’t precise; it’s much more like a single individual deciding how to evolve than a population selectively avoiding relatively bad approaches.]
Why even try to make inferences from evolution at all? Why try to learn from the failures of a process that was:
For me, it’s this Bismarck quote:
Only a fool learns from his own mistakes. The wise man learns from the mistakes of others.
If it looks to me like evolution made the mistake that I am trying to avoid, I would like to figure out which features of what it did caused that mistake, and whether or not my approach has the same features.
And, like, I think basically all five of your bullet points apply to LLM training?
“much stupider than us” → check? The LLM training system is using a lot of intelligence, to be sure, but there are relevant subtasks within it that are being run at subhuman levels.
“far more limited in the cognition-shaping tools available to it” → I am tempted to check this one, in that the cognition-shaping tools we can deploy at our leisure are much less limited than the ones we can deploy while training on a corpus, but I do basically agree that evolution probably used fewer tools than we can, and certainly had less foresight than our tools do.
“using a fundamentally different sort of approach (bi-level optimization over reward circuitry)” → I think the operative bit of this is probably the bi-level optimization, which I think is still happening, and not the bit where it’s for reward circuitry. If you think the reward circuitry is the operative bit, that might be an interesting discussion to try to ground out on.
One note here is that I think we’re going to get a weird reversal of human evolution, where the ‘unsupervised learning via predictions’ moves from the lifetime to the history. But this probably makes things harder, because now you have facts-about-the-world mixed up with your rewards and computational pathway design and all that.
“compensating for these limitations using resources completely unavailable to us (running millions of generations and applying tiny tweaks over a long period of time)” → check? Like, this seems like the primary reason why people are interested in deep learning over figuring out how things work and programming them the hard way. If you want to create something that can generate plausible text, training a randomly initialized net on <many> examples of next-token prediction is way easier than writing it all down yourself. Like, applying tiny tweaks over millions of generations is how gradient descent works!
“and not even dealing with an actual example of the phenomenon we want to understand!” → check? Like, prediction is the objective people used because it was easy to do in an unsupervised fashion, but they mostly don’t want to know how likely text strings are. They want to have conversations, they want to get answers, and so on.
I also think those five criticisms apply to human learning, tho it’s less obvious / might have to be limited to some parts of the human learning process. (For example, the control system that's determining which of my nerves to prune because of disuse seems much stupider than I am, but is only one component of learning.)
there’s also the issue that the failure mode behind “humans liking ice cream” is just fundamentally not worth much thought from an alignment perspective.
I agree with this section (given the premise); it does seem right that “the agent does the thing it’s rewarded to do” is not an inner alignment problem.
It does seem like an outer alignment problem (yes I get that you probably don’t like that framing). Like, I by default expect people to directly train the system to do things that they don’t want the system to do, because they went for simplicity instead of correctness when setting up training, or to try to build systems with general capabilities (which means the ability to do X) while not realizing that they need to simultaneously suppress the undesired capabilities.
(I’m going to include text from this other comment of yours so that I can respond to thematically similar things in one spot.)
Agreed that our perspectives differ. According to me, there are two different things going on (the self-organization of the brain during a lifetime and the natural selection over genes across lifetimes), both of which can be thought of as ‘training’.
It seems to me like if we’re looking at a particular deployed LLM, there are two main timescales, the long foundational training and then the contextual deployment. I think this looks a lot like having an evolutionary history of lots of ‘lifetimes’ which are relevant solely due to how they impact your process-of-interpreting-input, followed by your current lifetime in which you interpret some input.
That is, the ‘lifetime of learning’ that humans have corresponds to whatever’s going on inside the transformer as it reads thru a prompt context window. It probably includes some stuff like gradient descent, but not cleanly, and certainly not done by the outside operators.
What the outside operators can do is more like “run more generations”, often with deliberately chosen environments. [Of course SGD is different from genetic algorithms, and so the analogy isn’t precise; it’s much more like a single individual deciding how to evolve than a population selectively avoiding relatively bad approaches.]
For me, it’s this Bismarck quote:
If it looks to me like evolution made the mistake that I am trying to avoid, I would like to figure out which features of what it did caused that mistake, and whether or not my approach has the same features.
And, like, I think basically all five of your bullet points apply to LLM training?
“much stupider than us” → check? The LLM training system is using a lot of intelligence, to be sure, but there are relevant subtasks within it that are being run at subhuman levels.
“far more limited in the cognition-shaping tools available to it” → I am tempted to check this one, in that the cognition-shaping tools we can deploy at our leisure are much less limited than the ones we can deploy while training on a corpus, but I do basically agree that evolution probably used fewer tools than we can, and certainly had less foresight than our tools do.
“using a fundamentally different sort of approach (bi-level optimization over reward circuitry)” → I think the operative bit of this is probably the bi-level optimization, which I think is still happening, and not the bit where it’s for reward circuitry. If you think the reward circuitry is the operative bit, that might be an interesting discussion to try to ground out on.
One note here is that I think we’re going to get a weird reversal of human evolution, where the ‘unsupervised learning via predictions’ moves from the lifetime to the history. But this probably makes things harder, because now you have facts-about-the-world mixed up with your rewards and computational pathway design and all that.
“compensating for these limitations using resources completely unavailable to us (running millions of generations and applying tiny tweaks over a long period of time)” → check? Like, this seems like the primary reason why people are interested in deep learning over figuring out how things work and programming them the hard way. If you want to create something that can generate plausible text, training a randomly initialized net on <many> examples of next-token prediction is way easier than writing it all down yourself. Like, applying tiny tweaks over millions of generations is how gradient descent works!
“and not even dealing with an actual example of the phenomenon we want to understand!” → check? Like, prediction is the objective people used because it was easy to do in an unsupervised fashion, but they mostly don’t want to know how likely text strings are. They want to have conversations, they want to get answers, and so on.
I also think those five criticisms apply to human learning, tho it’s less obvious / might have to be limited to some parts of the human learning process. (For example,
the control system that's determining which of my nerves to prune because of disuse
seems much stupider than I am, but is only one component of learning.)I agree with this section (given the premise); it does seem right that “the agent does the thing it’s rewarded to do” is not an inner alignment problem.
It does seem like an outer alignment problem (yes I get that you probably don’t like that framing). Like, I by default expect people to directly train the system to do things that they don’t want the system to do, because they went for simplicity instead of correctness when setting up training, or to try to build systems with general capabilities (which means the ability to do X) while not realizing that they need to simultaneously suppress the undesired capabilities.