Okay, so now having thought about this a bit...
I at first read this and was like “I’m confused – isn’t this what the whole agent foundations agenda is for? Like, I know there are still kinks to work out, and some of this kinks are major epistemological problems. But… I thought this specific problem was not actually that confusing anymore.”
“Don’t have your AGI go off and do stupid things” is a hard problem, but it seemed basically to be restating “the alignment problem is hard, for lots of finnicky confusing reasons.”
Then I realized “holy christ most AGI research isn’t built off the agent foundations agenda and people regularly say ‘well, MIRI is doing cute math things but I don’t see how they’re actually relevant to real AGI we’re likely to build.’”
Meanwhile, I have several examples in mind of real humans who fell prey to something similar to commitment-race concerns. i.e. groups of people who mutually grim-triggered each other because they were coordinating on slightly different principles. (And these were humans who were trying to be rationalist and even agent-foundations-based)
So, yeah actually it seems pretty likely that many AGIs that humans might build might accidentally fall into these traps.
So now I have a vague image in my head of a rewrite of this post that ties together some combo of:
The specific concerns noted here
The rocket alignment problem “hey man we really need to make sure we’re not fundamentally confused about agency and rationality.”
Possibly some other specific agent-foundations-esque concerns
Weaving those into a central point of:
“If you’re the sort of person who’s like ‘Why is MIRI even helpful? I get how they might be helpful but they seem more like a weird hail-mary or a ‘might as well given that we’re not sure what else to do?’… here is a specific problem you might run into if you didn’t have a very thorough understanding of robust agency when you built your AGI. This doesn’t (necessarily) imply any particular AGI architecture, but if you didn’t have a specific plan for how to address these problems, you are probably going to get them wrong by default.
(This post might already exist somewhere, but currently these ideas feel like they just clicked together in my mind in a way they hadn’t previously. I don’t feel like I have the ability to write up the canonical version of this post but feel like “someone with better understanding of all the underlying principles” should)