Steven Byrnes comments on Why we should expect ruthless sociopath ASI

Steven Byrnes 19 Feb 2026 16:07 UTC
LW: 7 AF: 4
0
AF
If you compare a human in 30000 BC to a human today, our brains are full of new information that wasn’t in the training data of 30000 BC. I want to talk about: what would it look to be in a world where you can put millions of LLMs in a sealed box containing a VR environment, for (the equivalent of) thousands of years, and then we open up the box and find that the LLMs have made an analogous kind of scientific and technological progress? (See §1 of “Sharp Left Turn” discourse: An opinionated review.)
Spoiler: I think this is fundamentally impossible with LLMs as we know them today. Anyway, let’s explore the options.
One option is: the LLMs have super-long context windows that store textbooks for all the new fields of science and technology that were invented after we closed the box. I don’t this would work because LLMs (at least as we know them today) struggle with large amounts of interrelated complexity in the context window far outside the distribution of anything in the weights. A more likely option is: there’s a mechanism to update the LLM’s weights. Anyway, there has to be some selection mechanism, that decides what new content is worth keeping or not. If you come across a proto-idea, do you update the weights with it or not? If you just update with “whatever seems right” or something, then I claim that errors will compound over time and the whole system goes off the rails.
Basically, my claim is that this selection mechanism (for new knowledge, plans, strategies, etc.), whatever it is, has to be grounded in consequentialism, to work perpetually inside this closed box.
And as this process proceeds, (subjective) century after (subjective) century, the influence of pretraining would get diluted away, until everything is ultimately coming from the consequentialism-grounded selection mechanism.
Anyway, we can argue about the details, but I don’t think LLM people are thinking about how to get to an end-state of this box, in which (again) you close the box, then open it much later, and find that huge amounts of open-ended intellectual progress has occurred while it was closed, analogous to what global human civilization has created over the centuries. I think that if people tried to work out what such a box might look like in detail, they would find that it either needs ruthless consequentialist agency going into it, or else creates ruthless consequentialist agency while it runs. Or perhaps they’d just agree with me that LLMs are not cut out for populating this box and never will be.
Sorry if I’m missing your point.
You seem to be describing a shallow kind of imitation
I don’t think so… I think I’m making a narrower claim: the manner in which humans (alone and collectively) do open-ended continual learning, especially over extended periods of time, does not have an analogue in LLMs. This is different from the question of whether LLMs are imitating humans “deeply” vs “shallowly” at inference time. I’m certainly not one of the people who call LLMs “stochastic parrots” etc. The thing they’re doing at inference time is IMO clearly capturing a deep (+ also wide) level of knowledge / understanding.
- Victor Levoso 20 Feb 2026 4:09 UTC
  LW: 5 AF: 3
  0
  AF Parent
  Oh okay then I think some of my objections are wrong but then your post seems like It fails to explain the narrower claim well?. You are describing a failure of LLM to imitate humans as if It was a problem with imitation learning. If you put LLM in a box and you get a diferent results than if you put humans in a box you are describing LLM that are bad at human imitation. Namely they lack open-ended continual learning. As oposed to saying the problem is that you think cannot do continual learning on LLM without some form of consequentialism.
  In the case of very long context LLM you are even claiming LLM couldn’t be able to imitate human behaviour in their context.
  I like your box example better(we could also call It a country of geniuses on a closed datacenter) I feel like theres a lot of interesting debate to be had about what kind of improvements on LLM get us to them making lots of inventions in the box.
  And this seems important to me, because the obious to me question here is “can you imitation learn whatever process humans use to invent things without being ruthless consequentialists?”
  Or in another words can your whole research program if how to imitate the things that make social insticnts in the brain be bitter lesson-ed via imitation learning on long horizon tasks/data?.
  Or not even long horizon maybe It just generalizes from short horizons + external memory. Its unclear to me whether if you put smart and competent adult humans without the capability to remember more than 1h on a box + they already know how to write they wouldn’t be able to manage to invent arbitrary things with a lot of extra effort obsesively note taking and inventing better ways of using notes.
  Humans doing this if It works would works because It IS grounded in the consequentialist behaviour of humans . But It woudln’t be ruthless consequentialism becuse humans have social insticts.
  It seems like you are impliying LLM have something like the human social insticts via imitation already at inference time but you can’t use them in any way to boostrap to some continuous learning thing thats grounded in human-like consequentialism and that seems like thats the direction were interesting discusion lies ?.
  Also to be clear my own position is more on the side of thinking you can probably get something that could populate the box from LLM+RL+maybe some memory related change but in practice you likely do It by acidentaly making them ruthless consequentialists unless you really knew what you were doing or get extremely lucky.
  But I want to take the side of the AI optimists here because I feel like you haven’t adressed smarter versions of their position very well?.
  Even if the typical AI optimist hasn’t though that far. Though duno I don’t know what Antropic’s comparatively less pesimistic people think(and I expect there’s actually a wide range of views in there) but they have to be thinking about continual learning or how LLM will do long horizon tasks, and if still skeptical of ruthless consequentialists being a thing they’ll have some reason why they expect whatever solution to not lead to that.
  - Steven Byrnes 22 Feb 2026 0:48 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Right, I think LLMs are good at imitating “Joe today”, and good at imitating “Joe + 1 month of learning Introductory Category Theory”, but not good at imitating the delta between those, i.e. the way that Joe grows and changes over that 1 month of learning—or at least, not “good at imitating” in a way that would generalize to imitating a person learning a completely different topic that’s not in the training data.
    In other words, after watching 1000 people try to learn category theory over the course of a month (while keeping diaries), I claim that an LLM would learn category theory itself, and it would learn all the common misconceptions about category theory that people make as they start learning, but it wouldn’t learn “the general process of learning and sense-making itself” in a way that allows it to then autonomously invent some field that has not been invented yet.
    I had a long comment-thread argument with Cole Wyeth on this general topic last year: link. We didn’t resolve our disagreement and I eventually bowed out of the conversation, but you might find it helpful anyway. See especially my analogy to trying to imitation-learn AlphaZero improving itself through self-play.
    Or not even long horizon maybe It just generalizes from short horizons + external memory. Its unclear to me whether if you put smart and competent adult humans without the capability to remember more than 1h on a box + they already know how to write they wouldn’t be able to manage to invent arbitrary things with a lot of extra effort [obsessively] note taking and inventing better ways of using notes.
    My answer is “obviously not”. Here’s an example:
    Imagine that the “competent adult humans” were all from 100 years before linear algebra existed, and we are hoping that they will invent linear algebra. Now, linear algebra involves a giant pile of interlinked concepts: matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, and on and on. Now take a parade of these “competent human adults” with no prior exposure to any of this, and give them an hour each before they get fired, but they are allowed to send notes to each other. The goal is for them to collectively invent the entire edifice of linear algebra from scratch. I think it’s doomed. If you take a person who has never seen linear algebra before, then it will just take them a lot of time (much more than an hour) to internalize all these concepts and get sufficiently familiar with them to start building on them. It doesn’t matter how good the notes are, it just takes time to develop strong and deep intuitions about a new concept. It doesn’t matter how many people there are, because zero of those people will be able to push forward the frontier in the one hour before they get fired, because it takes longer than that to internalize a new conceptual space.
    (I don’t think it’s too relevant for this thread, but fun fact: there were some experiments by Ought in like 2018-2020 vaguely related to this, see e.g. this post on “relay programming”.)
    What links here?
    You can’t imitation-learn how to continual-learn by Steven Byrnes (16 Mar 2026 21:20 UTC; 186 points)
    - RohanS 22 Feb 2026 20:05 UTC
      LW: 5 AF: 4
      0
      AF Parent
      I’d be curious for you to say a bit more in response to this point from above:
      It seems like you are impliying LLM have something like the human social insticts via imitation already at inference time but you can’t use them in any way to boostrap to some continuous learning thing thats grounded in human-like consequentialism and that seems like thats the direction were interesting discusion lies ?.
      I’m moderately optimistic about our ability to get roughly human-like consequentialism from LLM-based AGI, with character training instilling a non-ruthless non-sociopath character that is still compatible with lots of consequentialist agency, like very good scientists/inventors/entrepreneurs/etc. who never do anything that could be very dangerous (because a bunch of useful things aren’t that risky, and because they have good enough moral principles to deliberately avoid or be careful around things that are risky).
      I think long-horizon RL or reflection or other components of the continual learning process could break the instilled character, but it seems >50% likely to me that between the preliminary character training and ongoing training and prompting to maintain good character, those things won’t dominate, and we’ll just have nice AIs. (I get a bit more nervous about this argument for ASI, but I think it may well hold up even there.)
      - Steven Byrnes 22 Feb 2026 21:21 UTC
        LW: 7 AF: 5
        2
        AF Parent
        The thing I’m skeptical of is maintaining non-ruthless behavior in the presence of arbitrary amounts of open-ended continual learning. By “open-ended continual learning”, I mean something analogous to what humans did between 30000 BC and today, e.g. inventing new fields, and then still more new fields that build on those new fields, etc. And the AI has to do that without any human input, given enough time.
        My actual belief is that this kind of open-ended continual learning is simply impossible in LLMs. If I’m wrong about that, then I would next claim that it requires continually updating the LLM weights (not just context window). I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.
        OK, so far I’ve argued that if this kind of continual learning is possible at all, it would require continual weight updates to lock in the new knowledge and ideas that the LLM generates—and not just one-time small updates, but more and more updates as the process continues, asymptoting to 100% of the training data.
        If you buy all that, how do you think these weight updates will work? Where do you think the “training data” for those updates will come from?
        Or if you don’t buy that, how do you think the continual learning will work?
        My experience is that lots of LLM-focused people say “open-ended continual learning will be solved somehow, I guess”, and not think too hard about exactly how it gets solved. And then that’s how the pea gets hidden under the thimble. Because actually, I claim, continual learning needs some kind of ground truth or else it will go off the rails, and that ground truth basically amounts to an objective function, and when the LLM continual-learns enough from that ground truth, all the niceness of pretraining gets diluted away in favor of the ruthless maximization of that objective function.
        Again, maybe you have some specific idea about how LLM open-ended continual learning would work that you think won’t have this problem? If so, what is it?
        RohanS 24 Feb 2026 6:47 UTC
        LW: 8 AF: 5
        0
        AF Parent
        (Slightly rambly comment, sorry)
        I agree open-ended continual learning (CL) is probably big, I have been thinking and writing about CL a bunch recently but tbh I don’t think I’m near the end of clarifying all my thoughts on it. (Still hope to publish a sequence on it with some collaborators soon though.)
        I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.
        I agree weight updates are probably needed, I like the way you phrased the limitation, it matches some thoughts I’ve had but never as precisely.
        I expect you understand continual learning and especially the brain better than I do, but it seems plausible to me that your interpretation of the alignment implications on top of that understanding is flawed.
        I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
        We can incorporate ongoing character training to ensure it has non-negligible asymptotic representation.
        I think humans interpret a lot of experiences we have in the context of our existing values, and this informs how we update. This can frequently reinforce our values.
        Self-verification seems like it may be an important part of the CL objectives, and this can include self-verification of alignment with existing character (which starts close to Claude’s current nice character and hopefully stays close with some desirable ironing-out).
        Maybe this “doing continual learning informed by existing values” is kind of similar to humans doing continual learning informed by human social instincts? I also think this is related to the confusion that I and some other commenters have about why imitation and consequentialism are the only options. I have a much messier list of possible update mechanisms that doesn’t seem like it fits cleanly into those two as broad categories. Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky? (Maybe there’s an active inference-related case to be made for consequentialism here, but I haven’t looked into that much, I’d be curious for someone to make that argument if so.)
        “Increasing quantity and quality of character training throughout continual learning” seems like a potentially promising avenue for interventions, do you agree?
        What links here?
        Should We Train Against (CoT) Monitors? by RohanS (23 Apr 2026 19:19 UTC; 41 points)
        Steven Byrnes 26 Feb 2026 18:31 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky?
        Right, I think humans have a distinction between beliefs and desires (“is versus ought”) that’s pretty disanalogous to how LLMs work (see discussion here), and our beliefs / “is”s get updated by predictive learning from sensory inputs. My dichotomy of consequentialism vs imitative learning in the OP was about the “ought” part, which predictive learning doesn’t help with. I.e., when you’re choosing your own actions in a novel domain, predictive learning doesn’t constrain your options.
        (And I think “actions” are important even for disembodied situations like “figuring things out by thinking about them”, see §1.1 here.)
        I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
        As a side-note, this whole conversation is pretty tricky because we’re talking about this vague hypothetical system (that allows one or more LLMs to autonomously invent and develop a rich new field of science via some form of continual learning), and I don’t think such a system is even possible, and you seem to think it might be possible but you haven’t spelled out all the details of how it would work. E.g. one of the problems is: there’s no training data for continual learning, because the new field of knowledge doesn’t exist yet. Relatedly, what’s the “objective”?
        Anyway, we can keep trying, but this might be a tricky conversation to make progress with.
        Back to the object level:
        “Interspersing character training” is an interesting idea (thanks), but after thinking about it a bit, here’s why I think it won’t work in this context. BTW I’m interpreting “character training” per the four-bullet-point “pipeline” here, lmk if you meant something different.
        Character training (as defined in that link) seems to rely on the idea that the tokens “I will be helpful, and honest, and harmless, blah blah…” is more likely to be followed by tokens that are in fact helpful, and honest, and harmless, blah blah, than tokens that are not prefixed by that constitution. That’s a good assumption for LLMs of today, but why? I claim: it’s because LLMs are generalizing from the human-created text of the pretraining data.
        As a thought experiment: If, everywhere on the internet and in every book etc., whenever a human said “I’m gonna be honest”, they then immediately lied, then character-training with a constitution that said “I will be honest” would lead to lying rather than honesty. Right? Indeed, it would be equivalent to flipping the definition of the word “honest” in the English language. So again, this illustrates how the constitution-based character training is relying on the model basically staying close to the statistical properties of the pretraining data.
        …But that means: the more that the weights drift away from their pretraining state, the less reason we have to expect this type of character training to work well, or at all.
        You might respond: “OK, we’ll instead do RLAIF with a fixed “judge”, i.e. one that does not have its weights continually updated.” That indeed avoids the problem above, but introduces different problems instead. If the optimization is powerful, then we’re optimizing against a fixed judge, and we should expect the system to jailbreak the judge or similar. Alternatively, if the optimization is weak (i.e. only slightly changing the model, as in the traditional KL-divergence penalty of RLHF), then I think it will eventually stop working as the model gradually drifts so far away from niceness that slight tweaks can’t pull it back. Or something like that.
        What links here?
        Should We Train Against (CoT) Monitors? by RohanS (23 Apr 2026 19:19 UTC; 41 points)
        RohanS 24 Feb 2026 17:59 UTC
        LW: 4 AF: 4
        0
        AF Parent
        [Edit: I don’t think this is saying anything that different than my comment above, but it is a slightly different framing.]
        
        Another point that I think might be quite important: we often set ourselves complex subgoals in line with our existing values, and then we try hard to achieve those goals, and we learn how to be more effective consequentialist agents at achieving that type of subgoal. There may be clearer feedback on how well we did at the subgoal than how well we achieved our existing values, but in lots of cases we notice if there’s a significant divergence between what we achieved and our underlying values, which moderates the consequentialist learning and is a pressure towards maintaining alignment.