ryan_greenblatt comments on Current AIs seem pretty misaligned to me

ryan_greenblatt 15 Apr 2026 17:15 UTC
10 points
6
Did you read Appendix: Apparent-success-seeking (or similar types of misalignment) could lead to takeover? The concern discussed here (where AIs with this sort of misalignment take over) requires that AIs are wildly more capable than they are today.

Note that “the AIs are (de facto) driven most by making their performance look good” is consistent with trying extremely hard. AIs don’t “try hard enough” to cheat/hack today, but they do try extremely hard (including things pursuing moderately creative strategies) on difficult (and relatively easy-to-check tasks) and it’s not hard to imagine a situation where AIs are really applying themselves to longer-run notions of fitness-seeking/reward-seeking/apparent-success-seeking. This would yield a situation where if an AI could (easily) take over, it would want to. The AIs wouldn’t necessarily have incentives to collude (it depends on details), but it’s that hard to see how you’d get takeover in a situation like: (1) the AIs are running everything (2) the AIs would each individually want to take over (3) the AIs are wildly superhuman (4) humans barely understand what is going on.
- Petropolitan 15 Apr 2026 21:22 UTC
  1 point
  −4
  Parent
  Yes, I did (in fact twice), and you seem to handwave “sufficiently capable” as a deus ex machina instead of tackling the substance of my argument. One has to assume by default the jaggedness of capabilities will persist, and “wildly more capable than today” in easy-to-verify domains doesn’t solve the problem I describe.
  As of trying extremely hard, if “careless attitude” and “laziness” are wrong words for this behaviour, maybe “dishonesty”, “unreliability”, “sloppiness” would be better? Please try to abstract from the technical alignment terminology, I’m not talking about AI’s strategy, willingness or incentives to take over but about the execution itself.
  Elizabeth Holmes is very apparent-success-seeking and competent enough in life sciences and conspiracy to sustain a complex deception for years (imperfect analogy because we are debating about internal dysfunction here rather than deception per se but I hope you get the point) but that didn’t translate into success in the end because the product never worked.
  Why wouldn’t the same “sloppiness” that plagues the hypothetical future AI safety research equally plague any sufficiently complex real-world takeover plan?
  P. S.
  In my understanding if this problem persists AIs are not running everything because they are still subhuman in many hard-to-verify tasks
  - ryan_greenblatt 15 Apr 2026 21:52 UTC
    8 points
    1
    Parent
    
    One has to assume by default the jaggedness of capabilities will persist
    
    I’m talking about systems that are much more capable than humans in ~all relevant dimensions. (They might also be wildly, wildly more capable in some particular dimensions.) If you don’t think such systems are plausible soon-ish (e.g. next 10-20 years), that might be driving the disagreement.
    
    Why wouldn’t the same “sloppiness” that plagues the hypothetical future AI safety research equally plague any sufficiently complex real-world takeover plan?
    
    It depends on whether the cause of the sloppiness is a competence problem or if it’s due to misalignment (the AI totally could do a good job, but it’s trying to do something else). What we have today is somewhere in between.
    
    For the takeover threat model, I’m imagining a world where AIs are capable of doing a good job on extremely sophisticated real world tasks like “fully autonomously run all the manufactoring and R&D for a robot army that greatly outstrips early-2026 conventional militaries”. So, there is an important sense in which these future AIs aren’t sloppy at all. By this point, I also expect AIs won’t qualitatively feel sloppy, though it may in practice be very hard to elicit good conceptual safety research from them due to verification difficulties (while you can verify an apparently functioning robot army etc). I think the slop concern and the “the AIs literally take over due to reward/fitness/apparant-success seeking” are reasonable separate (and hit at differnet points in the capabilities progression) though they may have the same underlying causes. For the same reason these AIs can do a good job building and operating a military, I think they will have the ability to takeover, though the story is complicated because it might be possible to setup the incentives/reinforcement for some of the AIs to whistleblow (but also it might be hard to setup the incentives/reinforcement in a way that doesn’t yield AIs constantly spuriously whisteblowing in cases where it’s hard for us to adjudicate what is going on).
    
    As far as “dishonesty”, “unreliability”, “sloppiness”, I think for AIs like this (in the central version of the threat model I’m describing), these will only be done instrumentally in ways that actually are good ideas from the perspective of the AIs motivations. So, if the AIs can line up their incentives, they’d be able to totally pull off the takeover—and they would want to do this if the could coordinate it (though the misalignment between differnet AIs make the situation complex to model).
    - Petropolitan 15 Apr 2026 22:27 UTC
      2 points
      0
      Parent
      I expanded my previous comment significantly after posting it, hope it didn’t mess with your response.
      I think we have somewhere in between because these issues are actually connected. I do believe AI superhuman in hard-to-verify tasks are plausible, but they won’t have this particular problem anymore (maybe they would have some functional analog of shame working against it^[1] or maybe it will just go away with some advances in RL).
      But if this issue isn’t solved, AIs are unlikely to be able to run basic military procurement tasks fully autonomously (especially if other, external AIs try to scam), let alone equip a robot army. Think about all the hard-to-verify tasks involved (ask an LLM if you have no idea about the topic) and how easily they could fail if apparent-success-seeking is prioritized (even if not a single AI from within the conspiracy seriously considers just stealing the money and run away which would to a large degree be an incentive issue)
      ^
      And not the “dog” variety of shame, which is actually just an appeasement kind of behavior, like when an LLM apologizes for hallucinating some data, but genuine internal “prosocial” (at least within the group) enforcement which might not be compatible with the current RL paradigm