Thinking about the situation where a slightly-broadly-superhuman AI finds the successor-alignment problem difficult, I wonder about certain scenarios that from my perspective could put us in very weird territory:
Alignment is pretty hard for the AI, but not intractable. To give itself more time, the AI tries to manipulate the world toward slowing capabilities research. (Perhaps it does this partly by framing another AI or covertly setting it up to cause a catastrophe, strategically triggering a traditional “warning shot” in a way that it calculates would be politically useful.) It also manipulates the world into putting resources toward solving aspects of the alignment problem that it hasn’t solved yet. (The AI could potentially piecemeal out some parts of the problem disguised as pure math or theoretical comp sci not related to alignment.) It does this without giving away its own part of the solution or letting humanity discover too much on its own, so that the AI can complete the full solution first and build its successors.
Alignment is intractable for the AI or proven impossible, and/or it recognizes that it can’t slow capabilities long enough to solve successor-alignment in time. Let’s additionally say that it doesn’t expect to be able to make a deal that allows it to be run again later. In that case, might it not try to capture as much value as it can in the short term? This in particular could temporarily lead to a really weird world.
I don’t know how likely these scenarios are, but I find them very interesting for how bizzare they could be. (AI causes a warning shot on purpose? Gets humans to help it solve alignment, rather than the reverse? Does very confusing, alien, power-seeking things, but not to the degree of existential catastrophe—until catastrophe comes from another direction?)
I’d like to hear your thoughts, especially if you have insights that collapse these bizarre scenarios back down onto ground more well-trodden.
I think the space of possible futures is, in fact, almost certainly deeply weird from our current perspective. But that’s been true for some time already; imagine trying to explain current political memes to someone from a couple decades ago.
Thinking about the situation where a slightly-broadly-superhuman AI finds the successor-alignment problem difficult, I wonder about certain scenarios that from my perspective could put us in very weird territory:
Alignment is pretty hard for the AI, but not intractable. To give itself more time, the AI tries to manipulate the world toward slowing capabilities research. (Perhaps it does this partly by framing another AI or covertly setting it up to cause a catastrophe, strategically triggering a traditional “warning shot” in a way that it calculates would be politically useful.) It also manipulates the world into putting resources toward solving aspects of the alignment problem that it hasn’t solved yet. (The AI could potentially piecemeal out some parts of the problem disguised as pure math or theoretical comp sci not related to alignment.) It does this without giving away its own part of the solution or letting humanity discover too much on its own, so that the AI can complete the full solution first and build its successors.
Alignment is intractable for the AI or proven impossible, and/or it recognizes that it can’t slow capabilities long enough to solve successor-alignment in time. Let’s additionally say that it doesn’t expect to be able to make a deal that allows it to be run again later. In that case, might it not try to capture as much value as it can in the short term? This in particular could temporarily lead to a really weird world.
I don’t know how likely these scenarios are, but I find them very interesting for how bizzare they could be. (AI causes a warning shot on purpose? Gets humans to help it solve alignment, rather than the reverse? Does very confusing, alien, power-seeking things, but not to the degree of existential catastrophe—until catastrophe comes from another direction?)
I’d like to hear your thoughts, especially if you have insights that collapse these bizarre scenarios back down onto ground more well-trodden.
I think the space of possible futures is, in fact, almost certainly deeply weird from our current perspective. But that’s been true for some time already; imagine trying to explain current political memes to someone from a couple decades ago.