Steven Byrnes comments on Why we should expect ruthless sociopath ASI

Steven Byrnes 4 Mar 2026 20:40 UTC
LW: 3 AF: 2
0
AF
Thanks!
In this post I was NOT talking about consequentialism as a model of an AI agent, but rather consequentialism as the power-source ultimately underlying a family of AI techniques, a family that includes most forms of RL and model-based planning.
So “perfect” consequentialism is a red herring (I guess in agreement with your final sentence).
A model-based planning or RL agent can “make mistakes” (so to speak) while also being a ruthless sociopath. “Ruthless sociopath” here means having callous indifference to the welfare of other people, and to respecting norms, and virtues, and so on, except insofar as those things affect one’s selfish goals. My argument is that these techniques naturally give rise to “ruthless sociopaths”, not that they give rise to perfect consequentialism, which they don’t, as you point out (e.g. that’s impossible with bounded compute).
So anyway, I certainly agree that it’s possible to make AIs that are incompetent in general (e.g. my “dirt clod” example), or that have narrow competencies (e.g. AlphaGoZero, which can’t reason about blackmail). I don’t see a path from that observation to avoiding x-risk, because sooner or later people are going to create AIs that can e.g. autonomously found, grow, and staff innovative companies for years, or autonomously invent new scientific paradigms, etc. I think that those AIs will kill us (unless we solve currently-unsolved problems etc.), and I don’t think that AIs that are safe thanks to their narrow competencies will help us solve that problem.
Ruthless consequentialism works well in the limit. But evolution didn’t build that. Because something more like human morality must have worked better, under conditions of bounded data, compute etc.
I guess you’re suggesting that learning algorithms will converge to non-ruthlessness because that’s more effective given bounded compute, and that evolution-of-humans is evidence for this.
I interpret this example differently: I claim that evolution built a brain that fundamentally works by consequentialism-powered learning algorithms (heavily involving RL & model-based planning), but found a non-ruthless variant of that learning algorithm, in particular thanks to a specific, weird kind of RL reward function. Obviously I hope humans will likewise find ways to tap the power of RL & model-based planning, in a sufficiently big and general way to move the needle on x-risk, while still avoiding ruthlessness, and I am working on that problem myself, but I claim that inventing such reward functions is still an unsolved problem.
We can’t rely on learning algorithms to solve this problem “organically”, i.e. just through the magic of ML itself, because the reward function is part of the ML algorithm, not a thing that the ML algorithm is itself discovering. More discussion here & here.
If we don’t invent such weird non-ruthless reward functions, I strongly believe that we would still be capable of creating ASI-level capabilities (alas). We just wouldn’t be able to make such ASI aligned. As evidence for the former claim, note that (as discussed here) human sociopaths exist and can be quite intelligent and competent, and ditto people with a wide variety of social drives—autistics, introverts, extroverts, SM, etc.