valentin forch comments on Why we should expect ruthless sociopath ASI

valentin forch 24 Feb 2026 21:24 UTC
3 points
0
I am sorry, I am not sure I quite understand what you are getting at with the Bezos and Stalin examples. If you agree that having ruthless sociopathic AHI (Stalin?) is a big deal, why start with the more distant, uncertain, and hard to reason about ASI scenario?
Can you walk through a concrete example of what someone can do with a such a system? Ideally something that’s very impactful, e.g. so impactful that it could plausibly cause or prevent human extinction.
I can’t give an example that goes much beyond self driving. However, self-driving (and other autonomous robotics applications) is quite a big deal (Tesla is currently worth more than “capture the light cone of all future value in the universe”-OpenAI). All vehicles suddenly going rogue (caused by sociopath AHI) would probably end human civilization (not quite human extinction).
I wanted to suggest that once you’ve mapped such a world model, and the world model is everything an agent can use for planning, you can do virtual look ahead, i.e. roll-outs in the world model conditioned on your policy, and see whether the policy ends up killing 200.000.000 people, or significantly more than one would expect under what humans statistically do in their day-to-day, strange effects at a distance that where not part of the goal state / seed state, etc. - i.e., screen for scheming.
I don’t think you can do anything nontrivially useful in the world if you have a cortex-like algorithm but don’t attach it to any RL, as I explained in my last comment.
I agree (if RL overlaps with the things I listed under GOFAI), but if you can seperate the world model from the RL part at any given point, you can uproot most sociopathic tendencies. Screen for sociopathic behaviour / intents via virtual roll-outs (see above) if you like with honeypots and throw away the policy if it looks suspicious.
“Not an open book” is an understatement. It’s a massive unlabeled data structure. It would be a huge research project to understand it, at best. Then if you want to do anything useful with it, you presumably need to do continual learning, so the data structure keeps changing, so you need to keep pausing the system and doing huge research projects
I think reverse-engineering the human brain’s learning algorithms and scaling them to ASI is much harder than doing MI on a world model that is more compact than current frontier LLMs and which is aligned with the human mind.
But that requires the RL part, and having the AI go out into the world, and do things autonomously, with open-ended continual learning, and no human in the loop. These big-big-big-deal things do not seem compatible with what you seem to be describing.
On the contrary, I believe it is outright impossible to go into the real world without having the brain-like world model down first (or be very tolerant to collateral damage even on ones own side—doesn’t seem like a valid business strategy to me).*
Meanwhile you get outcompeted by the next firm down the street that’s running 100,000 of these things at full speed.
This implies that the only way to escape your scenario is to solve the alignment problem + be faster than everyone else because only than they will copy your recipe, no?
How about we instead go the political route, convince everybody that having AHI gives us all the nice things we need (except for immortality [How is this a good thing on a societal scale again?] and incomprehensible works of super-human intellects), and collectively put an embargo on ASI research?
___
*Could you point me in some direction for a description of how ASI can be achieved in principle? Specifically, what are the requirements on training data? I understand it would involve model-based RL under your view. But will it bootstrap itself from random init or use human experiental priors (e.g., artifacts of human culural evolution / things that cannot be encoded in DNA like spider detectors)? Can it be trained in simulations or will it need real-world interactions?
- Steven Byrnes 25 Feb 2026 14:24 UTC
  3 points
  0
  Parent
  I think part of why we’re talking past each other is: I claim that, if you know how to make an AI that can autonomously do anything that a smart human adult can autonomously do, including what they can do over weeks, months, years, then I claim you already have radical superintelligence, or at most you’re like 1-2 years away from radical superintelligence. See Foom & Doom §1.7 and Response to Blake Richards: AGI, generality, alignment, & loss functions §3.2.
  For example, if you can make one AI that can do everything that John von Neumann can do, then you can almost definitely make 1,000,000 such AIs that are cooperating, thinking at superhuman speed, and telepathically sharing their knowledge. We can, if we like, call this collective “one AI”, albeit an AI that takes 1,000,000× more chips to run. And now it’s an ASI, right?
  Likewise, if you know how to make an AI with the charisma and strategizing skills of an average person, and you also know how to make an AI with the charisma and strategizing skills of Hitler, what’s stopping you from making an AI with dramatically more charisma and strategizing skills than Hitler?
  Anyway, I’m happy to focus on John von Neumann-level brain-like AGI (or AHI). I claim that we don’t know how to make such a thing that’s able to have large impacts on the world, and is not a ruthless sociopath. By “large impacts”, I mean e.g. as discussed in The Duplicator: Instant Cloning Would Make the World Economy Explode.
  I wanted to suggest that once you’ve mapped such a world model, and the world model is everything an agent can use for planning, you can do virtual look ahead, i.e. roll-outs in the world model conditioned on your policy, and see whether the policy ends up killing 200.000.000 people, or significantly more than one would expect under what humans statistically do in their day-to-day, strange effects at a distance that where not part of the goal state / seed state, etc. - i.e., screen for scheming.
  Let’s say (for concreteness) that the world-model is some fancy cousin of a Bayes net, with 10M unlabeled nodes and a giant list of 1B connections of the form: “NODE 1984357 implies NODE 9238572 with strength 0.16209” (which happens to correspond to something vaguely like: “a certain brand of tire often has a certain style of hubcap”, but we don’t know that, it’s unlabeled).
  And then a “plan” would be some list of let’s say a few thousand nodes: “CURRENT THOUGHT / PLAN = NODES 6405951, 4505739, 3901796, 3394766, …”
  First of all, what’s the procedure to figure out whether this particular plan will kill 200M people or not?
  Second of all, when a human is doing something in the world, he’s querying this world-model maybe 5 times a second, and also editing it 5 times a second. Presumably, a human-speed AHI would be similar. If so, do you imagine that a human will be inspecting each of these plans and each of these edits? If so, aren’t you cutting the speed down by many orders of magnitude? Or if not, i.e. if it’s an automated system, then how would that work?