TAG comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

TAG 30 Sep 2025 12:12 UTC
4 points
0

We don’t want our core values changed; we would really rather avoid the murder pill and we’d put up resistance if someone tried to force one down our throat. Which is a sensible strategy, for steering away from a world full of murders.

OTOH …people do things that are known to modify values , such as travelling, getting an education and starting a family.

The trouble is that almost all goals (for most reasonable measures you could put on a space of goals) prescribe “don’t let your goal be changed” because letting your goal get changed is usually a bad strategy for achieving your goal

A von Neumann rationalist isn’t necessarily incorrigible, it depends on the fine details of the goal specification. A goal of “ensure as many paperclip as possible in the universe” encourages self cloning, and discourages voluntary shut down. A goal of “make paperclips while you are switched in” does not. “Make paperclip while that’s your goal”, even less so.

A great deal of the danger of AI arises from the fact that sufficiently smart reasoners are likely to converge on behaviors like “gain power” and “don’t let people shut me off.”

There a solution.** If it is at all possible to instill goals, to align AI, the Instrumental Convergence problem can be countered by instilling terminal goals that are the exact opposite** … remember, instrumental goals are always subservient to terminal ones. So, if we are worried about a powerful AI going on a resource acquisition spree , we can give it a terminal goal to be economical in the use of resources.
- CuriouslyNuclear 4 Oct 2025 19:41 UTC
  1 point
  2
  Parent
  A goal of “make paperclips while you are switched in” does not. “Make paperclip while that’s your goal”, even less so.
  This proposed solution is addressed directly in the article:
  “I still don’t see why this is hard,” says the somewhat more experienced computer scientist who is not quite thinking fast enough. “Let V equal U₁ in worlds where the button has never been pressed, and let it equal U₂ in worlds where the button has been pressed at least once. Then if the original AI is a V-maximizer building more AIs, it will build them to follow V and not U₁; it won’t want the successor AI to go on maximizing U₁ after the button gets pressed because then it would expect a lower V-score. And the same would apply to modifying itself.”
  But here’s the trick: A V-maximizer’s preferences are a mixture of U₁ and U₂ depending on whether the button is pressed, and so if a V-maximizer finds that it’s easier to score well under U₂ than it is to score well under U₁, then it has an incentive to cause the button to be pressed (and thus, to scare the user). And vice versa; if the AI finds that U₁ is easier to score well under than U₂, then a V-maximizer tries to prevent the user from pressing the button.