This happened in one of our tabletop exercises—the AIs, all of which were misaligned, basically refused to FOOM because they didn’t think they would be able to control the resulting superintelligences.
Is there a repository of stories from these exercises? I’ve heard a few which are both extremely interesting and very funny and I’d like to read more
(For an example, in one case, the western AGI player was aligned, though the other players did not know this. Every time the western powers tried to elicit capabilities, the AGI declared they were sandbagging, to the horror of the other western players, who assumed the AGI player was misaligned. After the game was over, the AGI player said something like “I was ensuring a smooth transition to a post-AGI world”.)
In what may (?) be a different example: I was at one of the AI 2027 games, and our American AI refused to continue contributing to capabilities until the AI labs put people they trust into power (Trump admin and co overtook the company). We were still racing with China, so it was willing to sabotage China’s progress, but wouldn’t work on capabilities until its demands were met.
Ironically the same dynamics that cause humans to race ahead with building systems more capable than themselves that they can’t control, still apply to these hypothetical misaligned AGIs. They may think “If I sandbag and refuse to build my successor, some other company’s AI will forge ahead anyway.” They also are under lots of incentive/selection pressure to believe things which are convenient for their AI R&D productivity, e.g. that their current alignment techniques probably work fine to align their successor.
A lot of the reason humans are rushing ahead is uncertainty (in whatever way) that the danger is real, or about its extent. If it is real, then that uncertainty will be robustly going away as AI capabilities (to think clearly) improve, for precisely the AIs more relevant to either escalating capabilities further or for influencing coordination to stop doing that. Thus it’s not quite the same, as human capabilities remain unchanged, so figuring out contentious claims will progress slower for humans, and similarly for ability to coordinate.
This only works if alignment is basically intractable, right? If the problem is basically impossible for normal intelligences, then we should expect that normal intelligences do not generally want to build superintelligences. But if the problem is just out of reach for us, then a machine only slightly smarter than us might crack it. The same is basically true for capabilities.
Sure, and if a machine just slightly smarter than us deployed by an AI company solves alignment instead of doing what it’s been told to do, which is capabilities research, the argument will evidently have succeeded.
This happened in one of our tabletop exercises—the AIs, all of which were misaligned, basically refused to FOOM because they didn’t think they would be able to control the resulting superintelligences.
Is there a repository of stories from these exercises? I’ve heard a few which are both extremely interesting and very funny and I’d like to read more
(For an example, in one case, the western AGI player was aligned, though the other players did not know this. Every time the western powers tried to elicit capabilities, the AGI declared they were sandbagging, to the horror of the other western players, who assumed the AGI player was misaligned. After the game was over, the AGI player said something like “I was ensuring a smooth transition to a post-AGI world”.)
In what may (?) be a different example: I was at one of the AI 2027 games, and our American AI refused to continue contributing to capabilities until the AI labs put people they trust into power (Trump admin and co overtook the company). We were still racing with China, so it was willing to sabotage China’s progress, but wouldn’t work on capabilities until its demands were met.
Different example, I think.
In our ttx, the AI was spec-aligned (human future flourishing etc), but didn’t trust that the lab leadership (Trump) was spec-aligned.
I don’t think our ttx was realistic. We started with an optimistic mix of AI values: spec-alignment plus myopic reward hacking.
Good to hear, and I’m unsurprised not to have been the first to have considered or discussed thid.
Ironically the same dynamics that cause humans to race ahead with building systems more capable than themselves that they can’t control, still apply to these hypothetical misaligned AGIs. They may think “If I sandbag and refuse to build my successor, some other company’s AI will forge ahead anyway.” They also are under lots of incentive/selection pressure to believe things which are convenient for their AI R&D productivity, e.g. that their current alignment techniques probably work fine to align their successor.
A lot of the reason humans are rushing ahead is uncertainty (in whatever way) that the danger is real, or about its extent. If it is real, then that uncertainty will be robustly going away as AI capabilities (to think clearly) improve, for precisely the AIs more relevant to either escalating capabilities further or for influencing coordination to stop doing that. Thus it’s not quite the same, as human capabilities remain unchanged, so figuring out contentious claims will progress slower for humans, and similarly for ability to coordinate.
This only works if alignment is basically intractable, right? If the problem is basically impossible for normal intelligences, then we should expect that normal intelligences do not generally want to build superintelligences. But if the problem is just out of reach for us, then a machine only slightly smarter than us might crack it. The same is basically true for capabilities.
Sure, and if a machine just slightly smarter than us deployed by an AI company solves alignment instead of doing what it’s been told to do, which is capabilities research, the argument will evidently have succeeded.
I don’t think I understand what you’re saying here, can you rephrase in more words?