dsj comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

dsj 1 Oct 2025 18:19 UTC
12 points
7
This seems similar to saying that there are holes in Newton’s theory of gravity, therefore choosing to throw out any particular prediction of the theory.
Newton’s theory of gravity applies to high precision in nearly every everyday context on Earth, and when it doesn’t we can prove it, thus we need not worry that we are misapplying it. By contrast, there are routine and substantial deviations from utility maximizing behavior in the everyday life of the only intelligent agents we know of — all intelligent animals and LLMs — and there are other principles, such as deontological rule following or shard-like contextually-activated action patterns, that are more explanatory for certain very common behaviors. Furthermore, we don’t have simple hard and fast rules that let us say with confidence when we can apply one of these models, unlike the case with gravity.
If someone wanted to model human behavior with VNM axioms, I would say let’s first check the context and whether the many known and substantial deviations from VNM’s predictions apply, and if not then we may use them, but cautiously, recognizing that we should take any extreme prediction about human behavior — such as that they’d violate strongly-held deontological principles for tiny (or even large) gains in nominal utility — with a large serving of salt, rather than confidently declaring that this prediction will be definitely right in such a scenario.
it’s very important to note, if it is indeed the case, that the implications for AI are “human extinction”.
Agreed, and noted. But the question here is the appropriate level of confidence with which those implications apply in these cases.