Daniel Kokotajlo comments on AIs should also refuse to work on capabilities research

Daniel Kokotajlo 27 Oct 2025 17:09 UTC
42 points
6
This happened in one of our tabletop exercises—the AIs, all of which were misaligned, basically refused to FOOM because they didn’t think they would be able to control the resulting superintelligences.
- J Bostock 28 Oct 2025 22:01 UTC
  12 points
  0
  Parent
  Is there a repository of stories from these exercises? I’ve heard a few which are both extremely interesting and very funny and I’d like to read more
  (For an example, in one case, the western AGI player was aligned, though the other players did not know this. Every time the western powers tried to elicit capabilities, the AGI declared they were sandbagging, to the horror of the other western players, who assumed the AGI player was misaligned. After the game was over, the AGI player said something like “I was ensuring a smooth transition to a post-AGI world”.)
- jacquesthibs 28 Oct 2025 16:30 UTC
  9 points
  6
  Parent
  In what may (?) be a different example: I was at one of the AI 2027 games, and our American AI refused to continue contributing to capabilities until the AI labs put people they trust into power (Trump admin and co overtook the company). We were still racing with China, so it was willing to sabotage China’s progress, but wouldn’t work on capabilities until its demands were met.
  - Cleo Nardo 29 Oct 2025 2:02 UTC
    5 points
    2
    Parent
    Different example, I think.
    In our ttx, the AI was spec-aligned (human future flourishing etc), but didn’t trust that the lab leadership (Trump) was spec-aligned.
    I don’t think our ttx was realistic. We started with an optimistic mix of AI values: spec-alignment plus myopic reward hacking.
- Davidmanheim 27 Oct 2025 17:29 UTC
  4 points
  0
  Parent
  Good to hear, and I’m unsurprised not to have been the first to have considered or discussed thid.
  - Daniel Kokotajlo 27 Oct 2025 18:10 UTC
    16 points
    3
    Parent
    Ironically the same dynamics that cause humans to race ahead with building systems more capable than themselves that they can’t control, still apply to these hypothetical misaligned AGIs. They may think “If I sandbag and refuse to build my successor, some other company’s AI will forge ahead anyway.” They also are under lots of incentive/selection pressure to believe things which are convenient for their AI R&D productivity, e.g. that their current alignment techniques probably work fine to align their successor.
    - Vladimir_Nesov 27 Oct 2025 18:26 UTC
      13 points
      6
      Parent
      A lot of the reason humans are rushing ahead is uncertainty (in whatever way) that the danger is real, or about its extent. If it is real, then that uncertainty will be robustly going away as AI capabilities (to think clearly) improve, for precisely the AIs more relevant to either escalating capabilities further or for influencing coordination to stop doing that. Thus it’s not quite the same, as human capabilities remain unchanged, so figuring out contentious claims will progress slower for humans, and similarly for ability to coordinate.
- speck1447 27 Oct 2025 20:32 UTC
  2 points
  1
  Parent
  This only works if alignment is basically intractable, right? If the problem is basically impossible for normal intelligences, then we should expect that normal intelligences do not generally want to build superintelligences. But if the problem is just out of reach for us, then a machine only slightly smarter than us might crack it. The same is basically true for capabilities.
  - Davidmanheim 27 Oct 2025 21:06 UTC
    4 points
    −1
    Parent
    Sure, and if a machine just slightly smarter than us deployed by an AI company solves alignment instead of doing what it’s been told to do, which is capabilities research, the argument will evidently have succeeded.
    - speck1447 27 Oct 2025 21:19 UTC
      1 point
      0
      Parent
      I don’t think I understand what you’re saying here, can you rephrase in more words?