plex comments on rohinmshah’s Shortform

plex 18 Feb 2026 14:44 UTC
33 points
4
If I were to make statements like that (which I haven’t exactly), I would be referring to superintelligence misalignment risks specifically, as that seems like by far the tightest bottleneck on surviving futures. The linked paper says:
To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned.
Which don’t seem like the class of approach which could be sufficient for handling superintelligence-level optimization, for reasons I’m sure you’re tracking given you later say:
“This means that our approach covers conversational systems, agentic systems, reasoning, learned novel concepts, and some aspects of recursive improvement, while setting aside goal drift and novel risks from superintelligence as future work.”
Do you have a plan for superintelligence misalignment risks?
- Rohin Shah 19 Feb 2026 6:20 UTC
  5 points
  1
  Parent
  Some reactions:
  - I don’t think it makes sense to “have a plan” in the sense that is used in this community. See also disagreement #26.
  - Nonetheless to the extent I personally (not necessarily GDM!) “have a plan”, it might be “continually forecast capabilities and risks for some time out into the future, figure out how to address them, iterate”. If “have the AIs do our alignment homework” is a plan, then this should count as a plan too.
  - For misalignment in particular, I think the lines of defense outlined in the paper could scale to superintelligence (mostly on the alignment side). But I am not so dumb as to think that I have clearly foreseen every issue that might come up, so of course I should expect to be surprised and for other stuff I haven’t thought of to be important as well.
  - (Inevitably someone is going to say “you only get one try” or some such. The actual sensible point there is “at some point your approach has to generalize from AIs that can’t take over to AIs that can”. I agree by that point you need to have dealt with the issues. But that generalization gap is much smaller than the generalization gap between Gemini 3 and superintelligence.)
  - Iirc, “novel risks from superintelligence” wasn’t meant to gesture at misalignment, but rather other risks that come up that aren’t misalignment.
  What links here?
  - AI #157: Burn the Boats by Zvi (26 Feb 2026 13:30 UTC; 44 points)
  - plex 19 Feb 2026 10:47 UTC
    4 points
    2
    Parent
    Hum, I usually expect that large complex important projects should have a roadmap, some sketch of the future that goes well with details to fill in. The more detailed it is, the more we check it for consistency and likelihood to work. Does this match you general experience with planning projects trying to achieve a goal?
    What you say there looks like an extremely vague and high level roadmap that sounds to me like ‘we’ll figure out our plan as we go as data comes in’, plus automated alignment.
    I would be really enthusiastic for you and your team to try unblurring that roadmap, and seeing what difficulties you find at superintelligence level on the current path.
    - Rohin Shah 19 Feb 2026 13:18 UTC
      1 point
      0
      Parent
      Hum, I usually expect that large complex important projects should have a roadmap, some sketch of the future that goes well with details to fill in. The more detailed it is, the more we check it for consistency and likelihood to work. Does this match you general experience with planning projects trying to achieve a goal?
      No.
      It does match my general experience with moderate tactical projects (say, projects that involve up to about 10 person-years of research effort). But not for large complex important projects.
      (And e.g. this is very much not the standard advice for startups, which also have the problem of doing something novel.)
      What you say there looks like an extremely vague and high level roadmap that sounds to me like ‘we’ll figure out our add we go as data comes in’, plus automated alignment.
      Well yes, it’s an aside in a LessWrong comment that I dashed off in a few minutes.
      I would be really enthusiastic for you and your team to try unblurring that roadmap, and seeing what difficulties you find at superintelligence level on the current path.
      There is also a 100+ page paper that I linked in the original post, that goes into a fair amount of detail on what the various risks and mitigations might look like. In my experience, nobody outside of GDM really seems to care about its consistency or likelihood to work (except inasmuch as people dismiss it without reading it because of a prior that anything proposed currently will not work).
      - plex 19 Feb 2026 13:40 UTC
        1 point
        −8
        Parent
        No.
        Okay, that is a position which there might be good arguments for, but that seems important to say loudly and clearly, both inside GDM and outside, that you do not have a plan or roadmap for superintelligence misalignment (even if you don’t think you should have one). If nothing else, this is the kind of thing your leadership should be made aware of explicitly, so they can either adjust that or use it in their own public communications to try and reduce race dynamics.
        It does match my general experience with moderate tactical projects (say, projects that involve up to about 10 person-years of research effort). But not for large complex important projects.
        Okay, would you like to bet on whether some of the largest research programs had plans going into them? I haven’t checked, but I would put at least 10:1 odds that if we pick say 3 projects like Apollo Program, Manhattan Project, and others on a similar scale and type they will all have had a high level roadmap of things to try which could plausibly address the core challenges quite early on^[1], even if a lot of details ended up changing when they ran into reality.
        There is also a 100+ page paper that I linked in the original post, that goes into a fair amount of detail on what the various risks and mitigations might look like.
        When I ask a plain no special prompting history off AI to summarize, it says:
        (detailed analysis of non-superintelligence focused bits)
        Is there a different document which does focus on either different approaches which are aimed at superintelligence, or analyzing whether these approaches are actually fit for that challenge? Or is this summary incorrect, in a way it would be much easier for you to point out and quote the relevant sections, as an author of the paper, than me as someone who would have to read it from scratch and also currently does not expect to find things which explicitly address the most difficult bottleneck in those 100 pages.
        (I am genuinely glad you’re engaging, but I am not reassured so far, and encourage you to look at the stack of how you’re evaluating this specific concern I’m raising and see if you’re running a truth-seeking process which would, if I had a fair point, be able to notice)
        ^
        Let’s say a collection of core technical problems to be solved, and a set of plausible solutions to try (perhaps all of which were discarded, but were a starting point for exploration).
        Rohin Shah 19 Feb 2026 16:02 UTC
        4 points
        0
        Parent
        Okay, would you like to bet on whether some of the largest research programs had plans going into them? I haven’t checked, but I would put at least 10:1 odds that if we pick say 3 projects like Apollo Program, Manhattan Project, and others on a similar scale and type they will all have had a high level roadmap of things to try which could plausibly address the core challenges quite early on^[1], even if a lot of details ended up changing when they ran into reality.
        By this standard there is totally a plan / roadmap which is elaborated in that paper.
        But also this notion of a plan / roadmap has approximately no relation to the way “plan” is used in AI safety discourse in my experience.
        
        EDIT: There’s a 10 page executive summary you could read. Or you could read Section 6 on misalignment. Within that probably Amplified Oversight is the most relevant section. But I also don’t expect that this will change your mind ~at all because it isn’t really written with you as the intended audience. The AI summary is sometimes wrong/mistaken, sometimes correct but missing the point, and occasionally correct in a non-misleading way.