Steven Byrnes comments on Why we should expect ruthless sociopath ASI

Steven Byrnes 23 Feb 2026 16:35 UTC
4 points
2
I think the human brain’s super-power does not lie in its advanced RL-algorithms to figure out our life policy.
Bob is a mathematician. One day, he’s facing a certain type of problem, and he tries a certain proof strategy, and it doesn’t work. I claim that, next time Bob finds himself facing a similar type of problem, he’s less likely to try that particular proof strategy.
I claim that this change involved RL in Bob’s brain. And this process (repeated countless times) is essential to how Bob succeeds at math.
And likewise, gaining expertise in anything—piano, sports, real estate, legal defenses, whatever—involves RL in a person’s brain.
And not just acquiring the expertise, but also using it. The pianist needs RL to learn new songs. Bob the mathematician needs RL to gain proficiency with a new math concept that he just invented yesterday. A legal expert needs RL because each case is a bit different than anything she’s seen before. Etc.
See also: §1 of this post on why RL is core to “figuring things out”, which is core to human competence.
brain-like perception + GOFAI
I don’t understand this. The world-model will still involve millions of inscrutable unlabeled concepts, right? (See here, here.) What does GOFAI even mean in that context?
you can throw away the agentic part after you are done training and just keep the perception module that gives you the world model which you then query with GOFAI
As I mentioned in the OP, there’s a quadrillion-dollar market for AIs that can go out in the world and figure things out, 100% autonomously and way outside the distribution of things that humans already know. Someone will make such AIs sooner or later, unless they’re eternally prevented from doing so by force. I’m confused about how you think about that problem.
If you want to talk about making much weaker AIs safe, and “AHI”, isn’t that already kinda a solved problem, via the LLMs that exist right now? They already exist, they can already do lots of things, and they seem basically safe, in the grand scheme of things—i.e., negligible extinction risk, and some problems but only the usual kinds of problems associated with new technologies. I’m focused on the more powerful AIs of the future, which (I claim) might cause human extinction etc. If you’re focused on some other problem besides that, then what is it? I.e., what is motivating you to think about “brain-like perception + GOFAI”? What problem do you imagine that it might solve?
- valentin forch 23 Feb 2026 20:40 UTC
  3 points
  0
  Parent
  Thanks for taking time to respond.
  
  I am not saying humans don’t use RL. I am trying to say that RL is not what makes us special compared to current SotA (LLM or RL) models. It is our perception. AlphaZero blows us away in closed, non-fuzzy domains. Our ability for abstraction, which I claim is mostly an extension of perception, is what makes us special. Finding a robust hierarchy of coarse grainings in perceptual chaos through self-supervised learning where RL is mostly there to maximize for interestingness. Some call it understanding.
  By GOFAI I mean things like MCTS (with good pruning heuristics), TD learning (over a hierarchy of abstract states), production systems, model predictive control, etc. I claim that we don’t need to train a fancy policy network to build very useful stuff. Having a reasonable predictive model for car and environment dynamics with a sampling rate of, say, 50 Hz that does not break under adversarial perturbations over a time horizon of 30 seconds is extremely useful for example.
  
  Yes, a brain-like world model is not an open book. But it should be easier to understand this thing than some world model + some policy network which might be in superposition. When you have a brain-like world model you can combine it for instance with a (potentially malicious) policy and sample future trajectories to evaluate the expected outcomes of the policy.
  Yes, ASI will happen, somehow, sometime. I think your concerns are valid in principle and also apply for AHI involving unaligned end-to-end RL, which I think is also much more likely to occur in the next decade. LLMs as of today are not AHI and scaling them won’t get us to AHI. ASI scenarios however are extremely uncertain and there are many more known and unknown dangers to this which dilutes the expected utility of solving the problem you pose.
  Do you think there is no grave danger in consequentalist RL if ASI (something super human, incomprehensible by default) is not involved?
  - Steven Byrnes 24 Feb 2026 4:50 UTC
    3 points
    0
    Parent
    RL is not what makes us special compared to current SotA (LLM or RL) models
    I don’t think the human RL system is especially fancy, and I certainly don’t think it’s the secret sauce that will unlock ASI. I think the secret sauce of human intelligence is the cortex (and thalamus etc.), which I think agrees with you. But “the RL system is not some fancy secret sauce” is different from “the RL system is optional, and we can just leave it out entirely without sacrificing anything”. I don’t think you can do anything nontrivially useful in the world if you have a cortex-like algorithm but don’t attach it to any RL, as I explained in my last comment.
    There also seems to be a clash where I’m using the term “RL” more broadly than you are. For example, you categorize TD learning as part of GOFAI, whereas I would categorize TD learning as part of RL. This is just terminology, so whatever, but it does lead to us talking past each other. For example, I probably agree with “we don’t need to train a fancy policy network”. It doesn’t need to be fancy.
    Yes, a brain-like world model is not an open book. But it should be easier to understand this thing than some world model + some policy network which might be in superposition.
    It’s not “easier” unless you actually have a viable plan. Is it “easier” to survive a 100 km asteroid strike than a 150 km asteroid strike? Well, if you give me a choice, I’ll choose the 100 km one. But it doesn’t actually matter, because both of those options will definitely kill everyone.
    “Not an open book” is an understatement. It’s a massive unlabeled data structure. It would be a huge research project to understand it, at best. Then if you want to do anything useful with it, you presumably need to do continual learning, so the data structure keeps changing, so you need to keep pausing the system and doing huge research projects. Meanwhile you get outcompeted by the next firm down the street that’s running 100,000 of these things at full speed.
    When you have a brain-like world model you can combine it for instance with a (potentially malicious) policy and sample future trajectories to evaluate the expected outcomes of the policy.
    Can you walk through a concrete example of what someone can do with a such a system? Ideally something that’s very impactful, e.g. so impactful that it could plausibly cause or prevent human extinction.
    As an example of what I tend to think about: Jeff Bezos alone earned $250B. So just run 100 Bezos-level AHIs and ask them to start companies, and you can in principle earn $25T over the next 20 years.
    …And you can do a lot more than that! Stalin was a single human-level intelligence, and he maneuvered his way into dictatorial control over 200,000,000 people. Right?
    See what I mean? Actual human-level intelligence is a big, big, big deal.
    But that requires the RL part, and having the AI go out into the world, and do things autonomously, with open-ended continual learning, and no human in the loop. These big-big-big-deal things do not seem compatible with what you seem to be describing. The fact that you brought up self-driving cars seems to indicate that you’re not appreciating the stakes, I think? Or sorry if I’m misunderstanding.
    - valentin forch 24 Feb 2026 21:24 UTC
      3 points
      0
      Parent
      I am sorry, I am not sure I quite understand what you are getting at with the Bezos and Stalin examples. If you agree that having ruthless sociopathic AHI (Stalin?) is a big deal, why start with the more distant, uncertain, and hard to reason about ASI scenario?
      Can you walk through a concrete example of what someone can do with a such a system? Ideally something that’s very impactful, e.g. so impactful that it could plausibly cause or prevent human extinction.
      I can’t give an example that goes much beyond self driving. However, self-driving (and other autonomous robotics applications) is quite a big deal (Tesla is currently worth more than “capture the light cone of all future value in the universe”-OpenAI). All vehicles suddenly going rogue (caused by sociopath AHI) would probably end human civilization (not quite human extinction).
      I wanted to suggest that once you’ve mapped such a world model, and the world model is everything an agent can use for planning, you can do virtual look ahead, i.e. roll-outs in the world model conditioned on your policy, and see whether the policy ends up killing 200.000.000 people, or significantly more than one would expect under what humans statistically do in their day-to-day, strange effects at a distance that where not part of the goal state / seed state, etc. - i.e., screen for scheming.
      I don’t think you can do anything nontrivially useful in the world if you have a cortex-like algorithm but don’t attach it to any RL, as I explained in my last comment.
      I agree (if RL overlaps with the things I listed under GOFAI), but if you can seperate the world model from the RL part at any given point, you can uproot most sociopathic tendencies. Screen for sociopathic behaviour / intents via virtual roll-outs (see above) if you like with honeypots and throw away the policy if it looks suspicious.
      “Not an open book” is an understatement. It’s a massive unlabeled data structure. It would be a huge research project to understand it, at best. Then if you want to do anything useful with it, you presumably need to do continual learning, so the data structure keeps changing, so you need to keep pausing the system and doing huge research projects
      I think reverse-engineering the human brain’s learning algorithms and scaling them to ASI is much harder than doing MI on a world model that is more compact than current frontier LLMs and which is aligned with the human mind.
      But that requires the RL part, and having the AI go out into the world, and do things autonomously, with open-ended continual learning, and no human in the loop. These big-big-big-deal things do not seem compatible with what you seem to be describing.
      On the contrary, I believe it is outright impossible to go into the real world without having the brain-like world model down first (or be very tolerant to collateral damage even on ones own side—doesn’t seem like a valid business strategy to me).*
      Meanwhile you get outcompeted by the next firm down the street that’s running 100,000 of these things at full speed.
      This implies that the only way to escape your scenario is to solve the alignment problem + be faster than everyone else because only than they will copy your recipe, no?
      How about we instead go the political route, convince everybody that having AHI gives us all the nice things we need (except for immortality [How is this a good thing on a societal scale again?] and incomprehensible works of super-human intellects), and collectively put an embargo on ASI research?
      ___
      *Could you point me in some direction for a description of how ASI can be achieved in principle? Specifically, what are the requirements on training data? I understand it would involve model-based RL under your view. But will it bootstrap itself from random init or use human experiental priors (e.g., artifacts of human culural evolution / things that cannot be encoded in DNA like spider detectors)? Can it be trained in simulations or will it need real-world interactions?
      - Steven Byrnes 25 Feb 2026 14:24 UTC
        3 points
        0
        Parent
        I think part of why we’re talking past each other is: I claim that, if you know how to make an AI that can autonomously do anything that a smart human adult can autonomously do, including what they can do over weeks, months, years, then I claim you already have radical superintelligence, or at most you’re like 1-2 years away from radical superintelligence. See Foom & Doom §1.7 and Response to Blake Richards: AGI, generality, alignment, & loss functions §3.2.
        For example, if you can make one AI that can do everything that John von Neumann can do, then you can almost definitely make 1,000,000 such AIs that are cooperating, thinking at superhuman speed, and telepathically sharing their knowledge. We can, if we like, call this collective “one AI”, albeit an AI that takes 1,000,000× more chips to run. And now it’s an ASI, right?
        Likewise, if you know how to make an AI with the charisma and strategizing skills of an average person, and you also know how to make an AI with the charisma and strategizing skills of Hitler, what’s stopping you from making an AI with dramatically more charisma and strategizing skills than Hitler?
        Anyway, I’m happy to focus on John von Neumann-level brain-like AGI (or AHI). I claim that we don’t know how to make such a thing that’s able to have large impacts on the world, and is not a ruthless sociopath. By “large impacts”, I mean e.g. as discussed in The Duplicator: Instant Cloning Would Make the World Economy Explode.
        I wanted to suggest that once you’ve mapped such a world model, and the world model is everything an agent can use for planning, you can do virtual look ahead, i.e. roll-outs in the world model conditioned on your policy, and see whether the policy ends up killing 200.000.000 people, or significantly more than one would expect under what humans statistically do in their day-to-day, strange effects at a distance that where not part of the goal state / seed state, etc. - i.e., screen for scheming.
        Let’s say (for concreteness) that the world-model is some fancy cousin of a Bayes net, with 10M unlabeled nodes and a giant list of 1B connections of the form: “NODE 1984357 implies NODE 9238572 with strength 0.16209” (which happens to correspond to something vaguely like: “a certain brand of tire often has a certain style of hubcap”, but we don’t know that, it’s unlabeled).
        And then a “plan” would be some list of let’s say a few thousand nodes: “CURRENT THOUGHT / PLAN = NODES 6405951, 4505739, 3901796, 3394766, …”
        First of all, what’s the procedure to figure out whether this particular plan will kill 200M people or not?
        Second of all, when a human is doing something in the world, he’s querying this world-model maybe 5 times a second, and also editing it 5 times a second. Presumably, a human-speed AHI would be similar. If so, do you imagine that a human will be inspecting each of these plans and each of these edits? If so, aren’t you cutting the speed down by many orders of magnitude? Or if not, i.e. if it’s an automated system, then how would that work?