Eli Tyre comments on Ngo and Yudkowsky on alignment difficulty

Eli Tyre 25 Nov 2023 22:44 UTC
2 points
0
If you have the Textbook From 100 Years In The Future that gives the simple robust solutions for everything, that actually work, you can write a superintelligence that thinks 2 + 2 = 5 because the Textbook gives the methods for doing that which are simple and actually work in practice in real life.
A personal aside: As an aspiring rationalist, isn’t this...horrifying?

It is possible to design not just a mind, but a superintelligence, with patterns of cognition around a basic fact that are so robust, that even on superintelligent reflection, it doesn’t update?

What would that even look like? You would need to never actually rely on 2+2=5 in any calculation, because you would get the wrong answer. So you would always have to evaluate using other arithmetic facts which are equivalent (or would be, if you knew the true answer to 2+2). And the superintelligence looking at it’s own thought process needs to either not notice those diversions in it’s basic thought process, or notice them, but also evaluate them to be justified for some (false) reason.

And whatever false beliefs you have that maintain that justification be entangled with all your other beliefs. This isn’t just an isolated discrepancy.

And furthermore, the superintelligent would have to be robust to a community of minds pointing out its error. It would have to make arguments for why all of these contortions make sense, or to avoid the question entirely, all while otherwise operating as a superintelligent Bayesian.

But Eliezer thinks its possible for that whole thing to be stable, even in the limit of intelligence? If that’s true, our rationality is dependent on the grace of our initial conditions. However much we strive, if we started from the wrong self-reinforcing error pattern, there’s literally no way out.
- Richard_Ngo 26 Nov 2023 0:11 UTC
  8 points
  0
  Parent
  This comment feels like a central example of the kind of unhealthy thinking that I describe in this post: specifically, setting an implicit unrealistically high standard and then feeling viscerally negative about not meeting that standard, in a way that’s divorced from action-relevant considerations.