Ben Pace comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Ben Pace 1 Oct 2025 2:22 UTC
15 points
−2
My current model is that the VNM theorems are the best available theorems for modeling rational agents. Insofar as that’s accurate, it’s correct to say that they’re not the final theorems, but it’s kind of anti-helpful to throw out their conclusions? This seems similar to saying that there are holes in Newton’s theory of gravity, therefore choosing to throw out any particular prediction of the theory. It still seems like it’s been foundational for building game theory and microeconomic modeling and tons of other things, and so it’s very important to note, if it is indeed the case, that the implications for AI are “human extinction”.
- Jan_Kulveit 3 Oct 2025 8:48 UTC
  23 points
  4
  Parent
  My current model is that the VNM theorems are the best available theorems for modeling rational agents.
  
  Actually I don’t agree with that, unless you define rationality in a circular way, where you focus on what’s roughly in line with the assumptions.
  
  To avoid possible confusion about words: I don’t think VNM is that useful for modelling powerful and smart agents in this universe. VNM axioms don’t describe well humans, states or corporations, and they don’t describe well LLMs.
  
  To give a simple example of a better formal math: information-theoretic bounded rationality. This is still quite VNM like, but at least acknowledges the fact that in this universe, negentropy is not free. Without this fact, nothing makes sense.
  
  For example of not making sense: if VNM is so great, and evolution discovered agents, and optimized them a lot, why animals are very VNM-unlike? I guess clearly obviously part of the answer must be computation is not free, and VNM agent is extremely computationally hungry, in a sense bigger than the universe it is in. But negentropy is not free. This does not mean VNM agents would not work well in toy universes with 3 dimensions, or universes with free computations.
  
  (Hot take sidenote: I suspect you can learn more about intelligence and powerful and smart agents in this universe if you start from just “negentropy is not free” that when starting from VNM.)
  
  I don’t think ITBR is the final answer, but at least it is barking on somewhat better tree.
  
  Yes VNM has been foundational for game theory. Also … I think one deep lesson people learn when understanding game theory deeply is something like “single shot prisoners dilemmas do not exist”. The theory is trying to be a minimal abstraction of reality, and it probably succeeds “too much”, in the sense that abstracts away so much that basically always some critical feature of reality is missing, and the math does not matches what is happening. This does not preclude the theory being influential, but what people actually do is often something like asking “classical game theory clearly mis-predicts what is happening, so let’s try to figure out what it ignores even if you can’t ignore it, and write a paper about that”.
  
  Yes it has been foundational to econ. My impression is something like last 40 years in the part of econ which is closest to agent foundations, part of the work was on how people are not VNM, or even why what people do makes sense while it is not VNM.
  
  To end with what actually matters: my guess the most relevant things where VNM is likely off is does not handle compositionality well, and it does not handle preferences about internal computations. (More of this discussion in this post and comments Is “VNM-agent” one of several options, for what minds can grow up into?) Unfortunately describing compositionality and preferences over internal computations seem really critical for the specfic problem.
  
  With physics comparisons
  
  I think VNM per se makes way less many predictions about reality than Newtonian gravity, and often when it seems to makes some close to “first principles”, they don’t seem to match observation. For example based on VNM, one would assume smart people don’t update what they want based on evidence, just their beliefs. But this contradicts phenomenological experience.
  
  Different physics comparison may be something like black body radiation. It is possible to describe it using equipartition theorem classically and yes, it partially works in some domains, but it also it’s clearly broken and predicts ultraviolet catastrophe. In do agree throwing out arbitrary predictions of the theory would not be a good habit if I don’t have fully worked out quantum mechanics, but I think this is a different case, where it’s very reasonably to doubt predictions of the theory which seems to be stringly correlated with it predicting the UV catastrophe. (Also not that happy with this comparison)
- dsj 1 Oct 2025 18:19 UTC
  9 points
  4
  Parent
  This seems similar to saying that there are holes in Newton’s theory of gravity, therefore choosing to throw out any particular prediction of the theory.
  Newton’s theory of gravity applies to high precision in nearly every everyday context on Earth, and when it doesn’t we can prove it, thus we need not worry that we are misapplying it. By contrast, there are routine and substantial deviations from utility maximizing behavior in the everyday life of the only intelligent agents we know of — all intelligent animals and LLMs — and there are other principles, such as deontological rule following or shard-like contextually-activated action patterns, that are more explanatory for certain very common behaviors. Furthermore, we don’t have simple hard and fast rules that let us say with confidence when we can apply one of these models, unlike the case with gravity.
  If someone wanted to model human behavior with VNM axioms, I would say let’s first check the context and whether the many known and substantial deviations from VNM’s predictions apply, and if not then we may use them, but cautiously, recognizing that we should take any extreme prediction about human behavior — such as that they’d violate strongly-held deontological principles for tiny (or even large) gains in nominal utility — with a large serving of salt, rather than confidently declaring that this prediction will be definitely right in such a scenario.
  it’s very important to note, if it is indeed the case, that the implications for AI are “human extinction”.
  Agreed, and noted. But the question here is the appropriate level of confidence with which those implications apply in these cases.