Jan_Kulveit comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Jan_Kulveit 30 Sep 2025 10:21 UTC
28 points
7
My impression is what this mostly illustrates is—
VNM rationality is a dead-end—if your “toy environment” has VNM rationality and beliefs/goals decomposition baked in as assumptions, it makes the problem something between hard to reason about and unsolvable
- despite an attempt to make the book not rely on (dis-)continuity assumptions, these are so deeply baked in the authors reasoning that they shine through in very large fraction of the arguments, if you look behind the surface

My impression is a lot of confusion of the MIRI worldview comes from inability to understand why others don’t trust the VNM formalism and VNM convergence, and why others understand and don’t buy the discontinuity assumptions.
- Ben Pace 1 Oct 2025 2:22 UTC
  12 points
  −5
  Parent
  My current model is that the VNM theorems are the best available theorems for modeling rational agents. Insofar as that’s accurate, it’s correct to say that they’re not the final theorems, but it’s kind of anti-helpful to throw out their conclusions? This seems similar to saying that there are holes in Newton’s theory of gravity, therefore choosing to throw out any particular prediction of the theory. It still seems like it’s been foundational for building game theory and microeconomic modeling and tons of other things, and so it’s very important to note, if it is indeed the case, that the implications for AI are “human extinction”.
  - dsj 1 Oct 2025 18:19 UTC
    12 points
    7
    Parent
    This seems similar to saying that there are holes in Newton’s theory of gravity, therefore choosing to throw out any particular prediction of the theory.
    Newton’s theory of gravity applies to high precision in nearly every everyday context on Earth, and when it doesn’t we can prove it, thus we need not worry that we are misapplying it. By contrast, there are routine and substantial deviations from utility maximizing behavior in the everyday life of the only intelligent agents we know of — all intelligent animals and LLMs — and there are other principles, such as deontological rule following or shard-like contextually-activated action patterns, that are more explanatory for certain very common behaviors. Furthermore, we don’t have simple hard and fast rules that let us say with confidence when we can apply one of these models, unlike the case with gravity.
    If someone wanted to model human behavior with VNM axioms, I would say let’s first check the context and whether the many known and substantial deviations from VNM’s predictions apply, and if not then we may use them, but cautiously, recognizing that we should take any extreme prediction about human behavior — such as that they’d violate strongly-held deontological principles for tiny (or even large) gains in nominal utility — with a large serving of salt, rather than confidently declaring that this prediction will be definitely right in such a scenario.
    it’s very important to note, if it is indeed the case, that the implications for AI are “human extinction”.
    Agreed, and noted. But the question here is the appropriate level of confidence with which those implications apply in these cases.
  - Jan_Kulveit 3 Oct 2025 8:48 UTC
    11 points
    7
    Parent
    My current model is that the VNM theorems are the best available theorems for modeling rational agents.
    
    Actually I don’t agree with that, unless you define rationality in a circular way, where you focus on what’s roughly in line with the assumptions.
    
    To avoid possible confusion about words: I don’t think VNM is that useful for modelling powerful and smart agents in this universe. VNM axioms don’t describe well humans, states or corporations, and they don’t describe well LLMs.
    
    To give a simple example of a better formal math: information-theoretic bounded rationality. This is still quite VNM like, but at least acknowledges the fact that in this universe, negentropy is not free. Without this fact, nothing makes sense.
    
    For example of not making sense: if VNM is so great, and evolution discovered agents, and optimized them a lot, why animals are very VNM-unlike? I guess clearly obviously part of the answer must be computation is not free, and VNM agent is extremely computationally hungry, in a sense bigger than the universe it is in. But negentropy is not free. This does not mean VNM agents would not work well in toy universes with 3 dimensions, or universes with free computations.
    
    (Hot take sidenote: I suspect you can learn more about intelligence and powerful and smart agents in this universe if you start from just “negentropy is not free” that when starting from VNM.)
    
    I don’t think ITBR is the final answer, but at least it is barking on somewhat better tree.
    
    Yes VNM has been foundational for game theory. Also … I think one deep lesson people learn when understanding game theory deeply is something like “single shot prisoners dilemmas do not exist”. The theory is trying to be a minimal abstraction of reality, and it probably succeeds “too much”, in the sense that abstracts away so much that basically always some critical feature of reality is missing, and the math does not matches what is happening. This does not preclude the theory being influential, but what people actually do is often something like asking “classical game theory clearly mis-predicts what is happening, so let’s try to figure out what it ignores even if you can’t ignore it, and write a paper about that”.
    
    Yes it has been foundational to econ. My impression is something like last 40 years in the part of econ which is closest to agent foundations, part of the work was on how people are not VNM, or even why what people do makes sense while it is not VNM.
    
    To end with what actually matters: my guess the most relevant things where VNM is likely off is does not handle compositionality well, and it does not handle preferences about internal computations. (More of this discussion in this post and comments Is “VNM-agent” one of several options, for what minds can grow up into?) Unfortunately describing compositionality and preferences over internal computations seem really critical for the specfic problem.
    
    With physics comparisons
    
    I think VNM per se makes way less many predictions about reality than Newtonian gravity, and often when it seems to makes some close to “first principles”, they don’t seem to match observation. For example based on VNM, one would assume smart people don’t update what they want based on evidence, just their beliefs. But this contradicts phenomenological experience.
    
    Different physics comparison may be something like black body radiation. It is possible to describe it using equipartition theorem classically and yes, it partially works in some domains, but it also it’s clearly broken and predicts ultraviolet catastrophe. In do agree throwing out arbitrary predictions of the theory would not be a good habit if I don’t have fully worked out quantum mechanics, but I think this is a different case, where it’s very reasonably to doubt predictions of the theory which seems to be stringly correlated with it predicting the UV catastrophe. (Also not that happy with this comparison)
- Raemon 1 Oct 2025 19:53 UTC
  3 points
  0
  Parent
  Do you have a link to existing discussion of “VNM rationality is a dead end” that you think covers it pretty thoroughly?
  My offhand gesture of a response is “I get the gist of why VNM rationality assumptions are generally not true in real life and you should be careful about what assumptions you’re making here.”
  But, it seems like whether the next step is “and therefore the entire reasoning chain relying on them is sus enough you should throw it out” vs “the toy problem is still roughly mapping to stuff that is close-enough-to-true that the intuitions probably transfer” depends on the specifics.
  I think I do get why baking in assumptions belief/goal decomposition makes sense to be particularly worried about.
  I assume there has been past argumentation about this, and whether you think there is a version of the problem-statement that is grappling with the generators of what MIRI was trying to do, but, not making the mistakes you’re pointing at here.
- CuriouslyNuclear 4 Oct 2025 19:36 UTC
  1 point
  0
  Parent
  Explain a non-VNM-rational architecture which is very intelligent, but has goals that are toggleable with a button in a way that is immune to the failures discussed in the article (as well as the related failures).
  - Thomas Kwa 5 Oct 2025 0:10 UTC
    2 points
    0
    Parent
    EJT’s incomplete preferences proposal. But as far as I’m able to make out from the comments, you need to define a decision rule in addition to the utility function of an agent with incomplete preferences, and only some of those ways are compatible with shutdownability.
- Noosphere89 30 Sep 2025 14:40 UTC
  1 point
  1
  Parent
  I agree with the point around discontinuities (in particular, I think the assumption that incremental strategies for AI alignment won’t work do tend to rely on discontinuous progress, or at the very least progress being set near-maximal values), but disagree with the point around VNM rationality being a dead end.
  I do think they’re making the problem harder than it needs to be by implicitly assuming that all goals are long-term goals in the sense of VNM/Coherence arguments, because this removes solutions that rely on for example deontological urges not to lie even if it’s beneficial, and I don’t think the argument that all goals collapse into coherent goals either actually works or becomes trivial and not a constraint anymore.
  But I do think that the existence of goals that conform to coherence arguments/VNM rationality broadly construed is likely to occur conditional on us being able to make AIs coherent/usable for longer-term tasks, and my explanation of why this has not happened yet (at least at the relevant scale) is basically that their time horizon for completing tasks is 2 hours on average, and most of the long-term goals involve at least a couple months of planning, if not years.
  Edit: Another reason is that current models don’t have anything like a long-term memory, and this might already be causing issues with the METR benchmark.
  There are quite a few big issues with METR’s paper (though some issues are partially fixed), but the issues point towards LLM time horizons being shorter than what METR reported, not longer, so this point is even more true.
  So we shouldn’t be surprised that LLMs haven’t yet manifested the goals that the AI safety field hypothesized, they’re way too incapable currently.
  The other part is that I do think it’s possible to make progress even under something like a worst-case VNM frame, at least assuming that agents don’t have arbitrary decision theories, and the post Defining Corrigible and Useful Goals (which by the way is substantially underdiscussed on here), and the assumptions of reward being the optimization target and CDT being the default decision theory of AIs do look likely to hold in the critical regime due to empirical evidence.
  You might also like the post Defining Monitorable and Useful Goals.
  So I don’t think we should give up on directly attacking the hard problems of alignment in a coherence/VNM rationality setting.