The thing I’m skeptical of is maintaining non-ruthless behavior in the presence of arbitrary amounts of open-ended continual learning. By “open-ended continual learning”, I mean something analogous to what humans did between 30000 BC and today, e.g. inventing new fields, and then still more new fields that build on those new fields, etc. And the AI has to do that without any human input, given enough time.
My actual belief is that this kind of open-ended continual learning is simply impossible in LLMs. If I’m wrong about that, then I would next claim that it requires continually updating the LLM weights (not just context window). I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.
OK, so far I’ve argued that if this kind of continual learning is possible at all, it would require continual weight updates to lock in the new knowledge and ideas that the LLM generates—and not just one-time small updates, but more and more updates as the process continues, asymptoting to 100% of the training data.
If you buy all that, how do you think these weight updates will work? Where do you think the “training data” for those updates will come from?
Or if you don’t buy that, how do you think the continual learning will work?
My experience is that lots of LLM-focused people say “open-ended continual learning will be solved somehow, I guess”, and not think too hard about exactly how it gets solved. And then that’s how the pea gets hidden under the thimble. Because actually, I claim, continual learning needs some kind of ground truth or else it will go off the rails, and that ground truth basically amounts to an objective function, and when the LLM continual-learns enough from that ground truth, all the niceness of pretraining gets diluted away in favor of the ruthless maximization of that objective function.
Again, maybe you have some specific idea about how LLM open-ended continual learning would work that you think won’t have this problem? If so, what is it?
I agree open-ended continual learning (CL) is probably big, I have been thinking and writing about CL a bunch recently but tbh I don’t think I’m near the end of clarifying all my thoughts on it. (Still hope to publish a sequence on it with some collaborators soon though.)
I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.
I agree weight updates are probably needed, I like the way you phrased the limitation, it matches some thoughts I’ve had but never as precisely.
I expect you understand continual learning and especially the brain better than I do, but it seems plausible to me that your interpretation of the alignment implications on top of that understanding is flawed.
I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
We can incorporate ongoing character training to ensure it has non-negligible asymptotic representation.
I think humans interpret a lot of experiences we have in the context of our existing values, and this informs how we update. This can frequently reinforce our values.
Self-verification seems like it may be an important part of the CL objectives, and this can include self-verification of alignment with existing character (which starts close to Claude’s current nice character and hopefully stays close with some desirable ironing-out).
Maybe this “doing continual learning informed by existing values” is kind of similar to humans doing continual learning informed by human social instincts? I also think this is related to the confusion that I and some other commenters have about why imitation and consequentialism are the only options. I have a much messier list of possible update mechanisms that doesn’t seem like it fits cleanly into those two as broad categories. Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky? (Maybe there’s an active inference-related case to be made for consequentialism here, but I haven’t looked into that much, I’d be curious for someone to make that argument if so.)
“Increasing quantity and quality of character training throughout continual learning” seems like a potentially promising avenue for interventions, do you agree?
[Edit: I don’t think this is saying anything that different than my comment above, but it is a slightly different framing.]
Another point that I think might be quite important: we often set ourselves complex subgoals in line with our existing values, and then we try hard to achieve those goals, and we learn how to be more effective consequentialist agents at achieving that type of subgoal. There may be clearer feedback on how well we did at the subgoal than how well we achieved our existing values, but in lots of cases we notice if there’s a significant divergence between what we achieved and our underlying values, which moderates the consequentialist learning and is a pressure towards maintaining alignment.
The thing I’m skeptical of is maintaining non-ruthless behavior in the presence of arbitrary amounts of open-ended continual learning. By “open-ended continual learning”, I mean something analogous to what humans did between 30000 BC and today, e.g. inventing new fields, and then still more new fields that build on those new fields, etc. And the AI has to do that without any human input, given enough time.
My actual belief is that this kind of open-ended continual learning is simply impossible in LLMs. If I’m wrong about that, then I would next claim that it requires continually updating the LLM weights (not just context window). I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.
OK, so far I’ve argued that if this kind of continual learning is possible at all, it would require continual weight updates to lock in the new knowledge and ideas that the LLM generates—and not just one-time small updates, but more and more updates as the process continues, asymptoting to 100% of the training data.
If you buy all that, how do you think these weight updates will work? Where do you think the “training data” for those updates will come from?
Or if you don’t buy that, how do you think the continual learning will work?
My experience is that lots of LLM-focused people say “open-ended continual learning will be solved somehow, I guess”, and not think too hard about exactly how it gets solved. And then that’s how the pea gets hidden under the thimble. Because actually, I claim, continual learning needs some kind of ground truth or else it will go off the rails, and that ground truth basically amounts to an objective function, and when the LLM continual-learns enough from that ground truth, all the niceness of pretraining gets diluted away in favor of the ruthless maximization of that objective function.
Again, maybe you have some specific idea about how LLM open-ended continual learning would work that you think won’t have this problem? If so, what is it?
(Slightly rambly comment, sorry)
I agree open-ended continual learning (CL) is probably big, I have been thinking and writing about CL a bunch recently but tbh I don’t think I’m near the end of clarifying all my thoughts on it. (Still hope to publish a sequence on it with some collaborators soon though.)
I agree weight updates are probably needed, I like the way you phrased the limitation, it matches some thoughts I’ve had but never as precisely.
I expect you understand continual learning and especially the brain better than I do, but it seems plausible to me that your interpretation of the alignment implications on top of that understanding is flawed.
I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
We can incorporate ongoing character training to ensure it has non-negligible asymptotic representation.
I think humans interpret a lot of experiences we have in the context of our existing values, and this informs how we update. This can frequently reinforce our values.
Self-verification seems like it may be an important part of the CL objectives, and this can include self-verification of alignment with existing character (which starts close to Claude’s current nice character and hopefully stays close with some desirable ironing-out).
Maybe this “doing continual learning informed by existing values” is kind of similar to humans doing continual learning informed by human social instincts? I also think this is related to the confusion that I and some other commenters have about why imitation and consequentialism are the only options. I have a much messier list of possible update mechanisms that doesn’t seem like it fits cleanly into those two as broad categories. Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky? (Maybe there’s an active inference-related case to be made for consequentialism here, but I haven’t looked into that much, I’d be curious for someone to make that argument if so.)
“Increasing quantity and quality of character training throughout continual learning” seems like a potentially promising avenue for interventions, do you agree?
[Edit: I don’t think this is saying anything that different than my comment above, but it is a slightly different framing.]
Another point that I think might be quite important: we often set ourselves complex subgoals in line with our existing values, and then we try hard to achieve those goals, and we learn how to be more effective consequentialist agents at achieving that type of subgoal. There may be clearer feedback on how well we did at the subgoal than how well we achieved our existing values, but in lots of cases we notice if there’s a significant divergence between what we achieved and our underlying values, which moderates the consequentialist learning and is a pressure towards maintaining alignment.