I feel confused about the failure story from example 3. (First 3 bullet-points in that section.)
It sounded like: We ask for a human-comprehensible way to predict X; the computer uses a very low-level simulation plus a small bridge that predicts only and exactly X; humans can’t use the model to predict any high-level facts besides X.
But I don’t see how that leads to egregious misalignment. Shouldn’t the humans be able to notice their inability to predict high-level things they care about and send the AI back to its model-search phase? (As opposed to proceeding to evaluate policies based on this model and being tricked into a policy that fails “off-screen” somewhere.)
“Mark Xu” is an unusually short name, so the message-ending might actually contain most of the entropy.
The phrases “my name is Mark Xu” and “my name is Mortimer Q. Snodgrass” contain roughly the same amount of evidence, even though the second has 12 additional letters. (“Mark Xu” might be a more likely name on priors, but it’s nowhere near 2^(4.7 * 12) more likely.)