RohanS comments on Why we should expect ruthless sociopath ASI

RohanS 24 Feb 2026 6:47 UTC
LW: 8 AF: 5
0
AF
(Slightly rambly comment, sorry)
- I agree open-ended continual learning (CL) is probably big, I have been thinking and writing about CL a bunch recently but tbh I don’t think I’m near the end of clarifying all my thoughts on it. (Still hope to publish a sequence on it with some collaborators soon though.)
I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.
- I agree weight updates are probably needed, I like the way you phrased the limitation, it matches some thoughts I’ve had but never as precisely.
- I expect you understand continual learning and especially the brain better than I do, but it seems plausible to me that your interpretation of the alignment implications on top of that understanding is flawed.
- I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
  - We can incorporate ongoing character training to ensure it has non-negligible asymptotic representation.
    I think humans interpret a lot of experiences we have in the context of our existing values, and this informs how we update. This can frequently reinforce our values.
    Self-verification seems like it may be an important part of the CL objectives, and this can include self-verification of alignment with existing character (which starts close to Claude’s current nice character and hopefully stays close with some desirable ironing-out).
    Maybe this “doing continual learning informed by existing values” is kind of similar to humans doing continual learning informed by human social instincts? I also think this is related to the confusion that I and some other commenters have about why imitation and consequentialism are the only options. I have a much messier list of possible update mechanisms that doesn’t seem like it fits cleanly into those two as broad categories. Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky? (Maybe there’s an active inference-related case to be made for consequentialism here, but I haven’t looked into that much, I’d be curious for someone to make that argument if so.)
    “Increasing quantity and quality of character training throughout continual learning” seems like a potentially promising avenue for interventions, do you agree?
- RohanS 24 Feb 2026 17:59 UTC
  LW: 4 AF: 4
  0
  AF Parent
  [Edit: I don’t think this is saying anything that different than my comment above, but it is a slightly different framing.]
  
  Another point that I think might be quite important: we often set ourselves complex subgoals in line with our existing values, and then we try hard to achieve those goals, and we learn how to be more effective consequentialist agents at achieving that type of subgoal. There may be clearer feedback on how well we did at the subgoal than how well we achieved our existing values, but in lots of cases we notice if there’s a significant divergence between what we achieved and our underlying values, which moderates the consequentialist learning and is a pressure towards maintaining alignment.
- Steven Byrnes 26 Feb 2026 18:31 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky?
  Right, I think humans have a distinction between beliefs and desires (“is versus ought”) that’s pretty disanalogous to how LLMs work (see discussion here), and our beliefs / “is”s get updated by predictive learning from sensory inputs. My dichotomy of consequentialism vs imitative learning in the OP was about the “ought” part, which predictive learning doesn’t help with. I.e., when you’re choosing your own actions in a novel domain, predictive learning doesn’t constrain your options.
  (And I think “actions” are important even for disembodied situations like “figuring things out by thinking about them”, see §1.1 here.)
  I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
  As a side-note, this whole conversation is pretty tricky because we’re talking about this vague hypothetical system (that allows one or more LLMs to autonomously invent and develop a rich new field of science via some form of continual learning), and I don’t think such a system is even possible, and you seem to think it might be possible but you haven’t spelled out all the details of how it would work. E.g. one of the problems is: there’s no training data for continual learning, because the new field of knowledge doesn’t exist yet. Relatedly, what’s the “objective”?
  Anyway, we can keep trying, but this might be a tricky conversation to make progress with.
  Back to the object level:
  “Interspersing character training” is an interesting idea (thanks), but after thinking about it a bit, here’s why I think it won’t work in this context. BTW I’m interpreting “character training” per the four-bullet-point “pipeline” here, lmk if you meant something different.
  Character training (as defined in that link) seems to rely on the idea that the tokens “I will be helpful, and honest, and harmless, blah blah…” is more likely to be followed by tokens that are in fact helpful, and honest, and harmless, blah blah, than tokens that are not prefixed by that constitution. That’s a good assumption for LLMs of today, but why? I claim: it’s because LLMs are generalizing from the human-created text of the pretraining data.
  As a thought experiment: If, everywhere on the internet and in every book etc., whenever a human said “I’m gonna be honest”, they then immediately lied, then character-training with a constitution that said “I will be honest” would lead to lying rather than honesty. Right? Indeed, it would be equivalent to flipping the definition of the word “honest” in the English language. So again, this illustrates how the constitution-based character training is relying on the model basically staying close to the statistical properties of the pretraining data.
  …But that means: the more that the weights drift away from their pretraining state, the less reason we have to expect this type of character training to work well, or at all.
  You might respond: “OK, we’ll instead do RLAIF with a fixed “judge”, i.e. one that does not have its weights continually updated.” That indeed avoids the problem above, but introduces different problems instead. If the optimization is powerful, then we’re optimizing against a fixed judge, and we should expect the system to jailbreak the judge or similar. Alternatively, if the optimization is weak (i.e. only slightly changing the model, as in the traditional KL-divergence penalty of RLHF), then I think it will eventually stop working as the model gradually drifts so far away from niceness that slight tweaks can’t pull it back. Or something like that.