Steven Byrnes comments on Why we should expect ruthless sociopath ASI

Steven Byrnes 26 Feb 2026 18:31 UTC
LW: 2 AF: 2
0
AF
Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky?
Right, I think humans have a distinction between beliefs and desires (“is versus ought”) that’s pretty disanalogous to how LLMs work (see discussion here), and our beliefs / “is”s get updated by predictive learning from sensory inputs. My dichotomy of consequentialism vs imitative learning in the OP was about the “ought” part, which predictive learning doesn’t help with. I.e., when you’re choosing your own actions in a novel domain, predictive learning doesn’t constrain your options.
(And I think “actions” are important even for disembodied situations like “figuring things out by thinking about them”, see §1.1 here.)
I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
As a side-note, this whole conversation is pretty tricky because we’re talking about this vague hypothetical system (that allows one or more LLMs to autonomously invent and develop a rich new field of science via some form of continual learning), and I don’t think such a system is even possible, and you seem to think it might be possible but you haven’t spelled out all the details of how it would work. E.g. one of the problems is: there’s no training data for continual learning, because the new field of knowledge doesn’t exist yet. Relatedly, what’s the “objective”?
Anyway, we can keep trying, but this might be a tricky conversation to make progress with.
Back to the object level:
“Interspersing character training” is an interesting idea (thanks), but after thinking about it a bit, here’s why I think it won’t work in this context. BTW I’m interpreting “character training” per the four-bullet-point “pipeline” here, lmk if you meant something different.
Character training (as defined in that link) seems to rely on the idea that the tokens “I will be helpful, and honest, and harmless, blah blah…” is more likely to be followed by tokens that are in fact helpful, and honest, and harmless, blah blah, than tokens that are not prefixed by that constitution. That’s a good assumption for LLMs of today, but why? I claim: it’s because LLMs are generalizing from the human-created text of the pretraining data.
As a thought experiment: If, everywhere on the internet and in every book etc., whenever a human said “I’m gonna be honest”, they then immediately lied, then character-training with a constitution that said “I will be honest” would lead to lying rather than honesty. Right? Indeed, it would be equivalent to flipping the definition of the word “honest” in the English language. So again, this illustrates how the constitution-based character training is relying on the model basically staying close to the statistical properties of the pretraining data.
…But that means: the more that the weights drift away from their pretraining state, the less reason we have to expect this type of character training to work well, or at all.
You might respond: “OK, we’ll instead do RLAIF with a fixed “judge”, i.e. one that does not have its weights continually updated.” That indeed avoids the problem above, but introduces different problems instead. If the optimization is powerful, then we’re optimizing against a fixed judge, and we should expect the system to jailbreak the judge or similar. Alternatively, if the optimization is weak (i.e. only slightly changing the model, as in the traditional KL-divergence penalty of RLHF), then I think it will eventually stop working as the model gradually drifts so far away from niceness that slight tweaks can’t pull it back. Or something like that.