That is, we will have one opportunity to align our superintelligence. That’s why we’ll fail. It’s almost impossible to succeed at a difficult technical challenge when we have no opportunity to learn from our mistakes. But this rests on another implicit claim: Currently existing AIs are so dissimilar to the thing on the other side of FOOM that any work we do now is irrelevant.
I think this is an explicit claim in the book, actually? I think it’s at the beginning of chapter 10. (It also appears in the story of Sable, where the AI goes rogue because it does a self-modification that creates such a dissimilarity.)
I think “irrelevant” is probably right but something like “insufficient” is maybe clearer. The book describes people working in interpretability as heroes—in the same paragraph as it points out that being able to see that your AI is thinking naughty thoughts doesn’t mean you’ll be able to design an AI that doesn’t think naughty thoughts.
I think this is an explicit claim in the book, actually? I think it’s at the beginning of chapter 10. (It also appears in the story of Sable, where the AI goes rogue because it does a self-modification that creates such a dissimilarity.)
I think “irrelevant” is probably right but something like “insufficient” is maybe clearer. The book describes people working in interpretability as heroes—in the same paragraph as it points out that being able to see that your AI is thinking naughty thoughts doesn’t mean you’ll be able to design an AI that doesn’t think naughty thoughts.