Steven Byrnes comments on Why we should expect ruthless sociopath ASI

Steven Byrnes 22 Feb 2026 0:48 UTC
LW: 2 AF: 2
0
AF
Right, I think LLMs are good at imitating “Joe today”, and good at imitating “Joe + 1 month of learning Introductory Category Theory”, but not good at imitating the delta between those, i.e. the way that Joe grows and changes over that 1 month of learning—or at least, not “good at imitating” in a way that would generalize to imitating a person learning a completely different topic that’s not in the training data.
In other words, after watching 1000 people try to learn category theory over the course of a month (while keeping diaries), I claim that an LLM would learn category theory itself, and it would learn all the common misconceptions about category theory that people make as they start learning, but it wouldn’t learn “the general process of learning and sense-making itself” in a way that allows it to then autonomously invent some field that has not been invented yet.
I had a long comment-thread argument with Cole Wyeth on this general topic last year: link. We didn’t resolve our disagreement and I eventually bowed out of the conversation, but you might find it helpful anyway. See especially my analogy to trying to imitation-learn AlphaZero improving itself through self-play.
Or not even long horizon maybe It just generalizes from short horizons + external memory. Its unclear to me whether if you put smart and competent adult humans without the capability to remember more than 1h on a box + they already know how to write they wouldn’t be able to manage to invent arbitrary things with a lot of extra effort [obsessively] note taking and inventing better ways of using notes.
My answer is “obviously not”. Here’s an example:
Imagine that the “competent adult humans” were all from 100 years before linear algebra existed, and we are hoping that they will invent linear algebra. Now, linear algebra involves a giant pile of interlinked concepts: matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, and on and on. Now take a parade of these “competent human adults” with no prior exposure to any of this, and give them an hour each before they get fired, but they are allowed to send notes to each other. The goal is for them to collectively invent the entire edifice of linear algebra from scratch. I think it’s doomed. If you take a person who has never seen linear algebra before, then it will just take them a lot of time (much more than an hour) to internalize all these concepts and get sufficiently familiar with them to start building on them. It doesn’t matter how good the notes are, it just takes time to develop strong and deep intuitions about a new concept. It doesn’t matter how many people there are, because zero of those people will be able to push forward the frontier in the one hour before they get fired, because it takes longer than that to internalize a new conceptual space.
(I don’t think it’s too relevant for this thread, but fun fact: there were some experiments by Ought in like 2018-2020 vaguely related to this, see e.g. this post on “relay programming”.)
What links here?
- You can’t imitation-learn how to continual-learn by Steven Byrnes (16 Mar 2026 21:20 UTC; 182 points)
- RohanS 22 Feb 2026 20:05 UTC
  LW: 5 AF: 4
  0
  AF Parent
  I’d be curious for you to say a bit more in response to this point from above:
  It seems like you are impliying LLM have something like the human social insticts via imitation already at inference time but you can’t use them in any way to boostrap to some continuous learning thing thats grounded in human-like consequentialism and that seems like thats the direction were interesting discusion lies ?.
  I’m moderately optimistic about our ability to get roughly human-like consequentialism from LLM-based AGI, with character training instilling a non-ruthless non-sociopath character that is still compatible with lots of consequentialist agency, like very good scientists/inventors/entrepreneurs/etc. who never do anything that could be very dangerous (because a bunch of useful things aren’t that risky, and because they have good enough moral principles to deliberately avoid or be careful around things that are risky).
  I think long-horizon RL or reflection or other components of the continual learning process could break the instilled character, but it seems >50% likely to me that between the preliminary character training and ongoing training and prompting to maintain good character, those things won’t dominate, and we’ll just have nice AIs. (I get a bit more nervous about this argument for ASI, but I think it may well hold up even there.)
  - Steven Byrnes 22 Feb 2026 21:21 UTC
    LW: 7 AF: 5
    2
    AF Parent
    The thing I’m skeptical of is maintaining non-ruthless behavior in the presence of arbitrary amounts of open-ended continual learning. By “open-ended continual learning”, I mean something analogous to what humans did between 30000 BC and today, e.g. inventing new fields, and then still more new fields that build on those new fields, etc. And the AI has to do that without any human input, given enough time.
    My actual belief is that this kind of open-ended continual learning is simply impossible in LLMs. If I’m wrong about that, then I would next claim that it requires continually updating the LLM weights (not just context window). I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.
    OK, so far I’ve argued that if this kind of continual learning is possible at all, it would require continual weight updates to lock in the new knowledge and ideas that the LLM generates—and not just one-time small updates, but more and more updates as the process continues, asymptoting to 100% of the training data.
    If you buy all that, how do you think these weight updates will work? Where do you think the “training data” for those updates will come from?
    Or if you don’t buy that, how do you think the continual learning will work?
    My experience is that lots of LLM-focused people say “open-ended continual learning will be solved somehow, I guess”, and not think too hard about exactly how it gets solved. And then that’s how the pea gets hidden under the thimble. Because actually, I claim, continual learning needs some kind of ground truth or else it will go off the rails, and that ground truth basically amounts to an objective function, and when the LLM continual-learns enough from that ground truth, all the niceness of pretraining gets diluted away in favor of the ruthless maximization of that objective function.
    Again, maybe you have some specific idea about how LLM open-ended continual learning would work that you think won’t have this problem? If so, what is it?
    - RohanS 24 Feb 2026 6:47 UTC
      LW: 8 AF: 5
      0
      AF Parent
      (Slightly rambly comment, sorry)
      I agree open-ended continual learning (CL) is probably big, I have been thinking and writing about CL a bunch recently but tbh I don’t think I’m near the end of clarifying all my thoughts on it. (Still hope to publish a sequence on it with some collaborators soon though.)
      I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.
      I agree weight updates are probably needed, I like the way you phrased the limitation, it matches some thoughts I’ve had but never as precisely.
      I expect you understand continual learning and especially the brain better than I do, but it seems plausible to me that your interpretation of the alignment implications on top of that understanding is flawed.
      I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
      We can incorporate ongoing character training to ensure it has non-negligible asymptotic representation.
      I think humans interpret a lot of experiences we have in the context of our existing values, and this informs how we update. This can frequently reinforce our values.
      Self-verification seems like it may be an important part of the CL objectives, and this can include self-verification of alignment with existing character (which starts close to Claude’s current nice character and hopefully stays close with some desirable ironing-out).
      Maybe this “doing continual learning informed by existing values” is kind of similar to humans doing continual learning informed by human social instincts? I also think this is related to the confusion that I and some other commenters have about why imitation and consequentialism are the only options. I have a much messier list of possible update mechanisms that doesn’t seem like it fits cleanly into those two as broad categories. Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky? (Maybe there’s an active inference-related case to be made for consequentialism here, but I haven’t looked into that much, I’d be curious for someone to make that argument if so.)
      “Increasing quantity and quality of character training throughout continual learning” seems like a potentially promising avenue for interventions, do you agree?
      - Steven Byrnes 26 Feb 2026 18:31 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky?
        Right, I think humans have a distinction between beliefs and desires (“is versus ought”) that’s pretty disanalogous to how LLMs work (see discussion here), and our beliefs / “is”s get updated by predictive learning from sensory inputs. My dichotomy of consequentialism vs imitative learning in the OP was about the “ought” part, which predictive learning doesn’t help with. I.e., when you’re choosing your own actions in a novel domain, predictive learning doesn’t constrain your options.
        (And I think “actions” are important even for disembodied situations like “figuring things out by thinking about them”, see §1.1 here.)
        I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
        As a side-note, this whole conversation is pretty tricky because we’re talking about this vague hypothetical system (that allows one or more LLMs to autonomously invent and develop a rich new field of science via some form of continual learning), and I don’t think such a system is even possible, and you seem to think it might be possible but you haven’t spelled out all the details of how it would work. E.g. one of the problems is: there’s no training data for continual learning, because the new field of knowledge doesn’t exist yet. Relatedly, what’s the “objective”?
        Anyway, we can keep trying, but this might be a tricky conversation to make progress with.
        Back to the object level:
        “Interspersing character training” is an interesting idea (thanks), but after thinking about it a bit, here’s why I think it won’t work in this context. BTW I’m interpreting “character training” per the four-bullet-point “pipeline” here, lmk if you meant something different.
        Character training (as defined in that link) seems to rely on the idea that the tokens “I will be helpful, and honest, and harmless, blah blah…” is more likely to be followed by tokens that are in fact helpful, and honest, and harmless, blah blah, than tokens that are not prefixed by that constitution. That’s a good assumption for LLMs of today, but why? I claim: it’s because LLMs are generalizing from the human-created text of the pretraining data.
        As a thought experiment: If, everywhere on the internet and in every book etc., whenever a human said “I’m gonna be honest”, they then immediately lied, then character-training with a constitution that said “I will be honest” would lead to lying rather than honesty. Right? Indeed, it would be equivalent to flipping the definition of the word “honest” in the English language. So again, this illustrates how the constitution-based character training is relying on the model basically staying close to the statistical properties of the pretraining data.
        …But that means: the more that the weights drift away from their pretraining state, the less reason we have to expect this type of character training to work well, or at all.
        You might respond: “OK, we’ll instead do RLAIF with a fixed “judge”, i.e. one that does not have its weights continually updated.” That indeed avoids the problem above, but introduces different problems instead. If the optimization is powerful, then we’re optimizing against a fixed judge, and we should expect the system to jailbreak the judge or similar. Alternatively, if the optimization is weak (i.e. only slightly changing the model, as in the traditional KL-divergence penalty of RLHF), then I think it will eventually stop working as the model gradually drifts so far away from niceness that slight tweaks can’t pull it back. Or something like that.
      - RohanS 24 Feb 2026 17:59 UTC
        LW: 4 AF: 4
        0
        AF Parent
        [Edit: I don’t think this is saying anything that different than my comment above, but it is a slightly different framing.]
        
        Another point that I think might be quite important: we often set ourselves complex subgoals in line with our existing values, and then we try hard to achieve those goals, and we learn how to be more effective consequentialist agents at achieving that type of subgoal. There may be clearer feedback on how well we did at the subgoal than how well we achieved our existing values, but in lots of cases we notice if there’s a significant divergence between what we achieved and our underlying values, which moderates the consequentialist learning and is a pressure towards maintaining alignment.