Raemon comments on Plan 1 and Plan 2

Raemon 24 Oct 2025 17:24 UTC
22 points
9
A thing I’ve been thinking lately (this is reposted from a twitter thread where it was more squarely on-topic, but seems like a reasonable part of the convo here, riffing off the Tsvi thread)
It matters a fair amount which biases people have, here.
A few different biases pointing in the “Plan 2 for bad reasons” direction:
1. a desire for wealth
2. a desire to not look weird in front of your friends
3. a desire to “be important”
4. subtly different from #3, a desire to “have some measure of control over the big forces playing out.”
5. a desire to be high status in the world’s Big League status tier
6. action bias, i.e. inability to do nothing.
7. bias against abstract arguments that you can’t clearly see/measure, or against sitting with confusion.
8. bias to think things are basically okay and you don’t need to majorly change your life plans.
9. being annoyed at people who keep trying to stop you or make you feel bad or be lower status.
10. being annoyed at people who seem to be missing an important point when they argue with you about AI doom.
All of these seem in-play to me. But depending on these things’ relative strength, they suggest different modes of dealing with the problem.
A reason I am optimistic about If Anyone Builds It, is because I think it has a decent chance of changing how reasonable it feels to say “yo guys I do think we might kill everyone” in front of both your friends, and high status big wigs.
This won’t be sufficient to change decisionmaking at labs, or people’s propensity to join labs. But I think the next biggest bias is more like “feeling important/in-control,” than “having wealth.”
I view this all pretty cynically. BUT, not necessarily pessimistically. If IABIED works, then, the main remaining blockers are “having an important/in-control thing to do, which groks some arguments that are more abstract.”
You don’t have to get rid of people’s bias, or defeat them memetically. (Although those are both live options too). You can also steer towards a world where their bias becomes irrelevant.
So, while I do really wanna grab people by the collar and shout:
“Dudes, Dario is one of the most responsible parties for causing the race conditions that Anthropic uses to justify their actions, and he lied or was grossly negligent about whether Anthropic would push the capabilities frontier. If your ‘well Plan 2 seems more tractable’ attitude doesn’t include ‘also, our leader was the guy who gave the current paradigm to OpenAI, then left OpenAI, gained early resources via deception/communication-negligence and caused the current race to start in earnest’ you have a missing mood and that’s fucked.”
...I also see part of my goal as trying to help the “real alignment work” technical field reach a point where the stuff-that-needs-doing is paradigmatic enough that you can just point at it, and the action-biased-philosophy-averse lab “safety” people can just say “oh, sure it sounds obvious when you put it like that, why didn’t you say that before?”
- Towards_Keeperhood 24 Oct 2025 19:16 UTC
  2 points
  3
  Parent
  Thanks for writing that list!
  ...I also see part of my goal as trying to help the “real alignment work” technical field reach a point where the stuff-that-needs-doing is paradigmatic enough that you can just point at it, and the action-biased-philosophy-averse lab “safety” people can just say “oh, sure it sounds obvious when you put it like that, why didn’t you say that before?”
  This seems extremely unrealistic to me. Not sure how you imagine that might work.
  - Raemon 25 Oct 2025 21:03 UTC
    2 points
    0
    Parent
    I am maybe starting from the assumption that sooner-or-later, alignment research would reach this point, and “well, help the alignment research progress as fast possible” seemed like a straightforward goal on the meta-level and is one of the obvious things to be shooting for whether or not it’s currently tractable.
    I have a current set of projects, but, the meta-level one is “look for ways to systematically improve people’s ability to quickly navigate confusing technical problems, and see what works, and stack as many interventions as we can.”
    (I can drill into that but I’m not sure what level your skepticism was one)
    - Towards_Keeperhood 25 Oct 2025 22:51 UTC
      1 point
      0
      Parent
      I have a current set of projects, but, the meta-level one is “look for ways to systematically improve people’s ability to quickly navigate confusing technical problems, and see what works, and stack as many interventions as we can.”
      Yeah so I think I read all the posts about that you wrote in the last 2 years.
      I think such techniques and the meta skill of deriving more of them are very important for making good alignment progress, but I still think it takes geniuses. (In fact I sorta tried my shot at the alignment problem over the last 3ish years where I often reviewed and looked at what I could do better, and started developing a more systematic sciency approach for studying human minds. I’m now pivoting to working on Plan 1 though, because there’s not enough time left.)
      Like currently it takes a some special kind of genius to make useful progress. Maybe we could try to train the smartest young supergeniuses in the techniques we currently have, and maybe they then could make progress much faster than me or Eliezer. Or maybe they still wouldn’t be able to judge what is good progress.
      If you actually get supergeniuses you could try to train that would obviously be quite useful to try, even though it likely won’t get done on time, but if they don’t end up running away with their dumb idea for solving alignment without understanding the systems they are dealing with, it would still be great for needing less time after the ban.
      (Steven Byrnes’ agenda seems to be potentially more scaleable with good methodology, and it has the advantage that progress is relatively less exfohazardry, but would still take very smart people (and relatively long study) to make progress. But it won’t get done on time, so you still need international coordination to not build ASI.)
      But the way you seemed to motivate your work in your previous comment sounded more like “make current safety researchers do more productive work so we might actually solve alignment without international coordination”. Seems very difficult to me, I think they are not even tracking many problems that are actually rather obvious or not getting the difficulties that are easy to get. People somehow often have a very hard time understanding relevant concepts here. E.g. even special geniuses like John Wentworth and Steven Byrnes made bad attempts at attacking corrigibility where they misunderstood the problem (1, 2), although that’s somewhat cherry picked and may be fixable. I mean not that such MIRI-like research is likely necessary, but still. Though I’m still curious about how you imagine your project might help here more precisely.
      - Raemon 26 Oct 2025 1:57 UTC
        3 points
        0
        Parent
        My overall plan is:
        try for as much international coordination as we can get.
        (this is actually my main focus right now, I expect to pivot back towards intellectual-progress stuff once I feel like we’ve done all the obvious things for waking up the world)
        try to make progress as fast as we can on the technical bits
        probably, we don’t either as much coordination or technical progress as we want, but, like, try our best on both axes and hope it’s enough
        I’m not sure whether you need literal geniuses, but, I do think you need a lot of raw talent for it to plausibly work. I’m uncertain if, like, say you need be “a level 8 researcher” to make nonzero progress on the core bottlenecks, is it possible to upgrade a level 6 or 7 person to level 8, or are level 8s basically born not made and the only hope is in finding internentions that bump 8s into 9s)
        I frequently see people who seem genius-like who nonetheless are kinda metacognitively dumb and seem like there’s low hanging fruit there, so I feel like either world is tractable, but, they are pretty different.