cousin_it comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

cousin_it 3 Nov 2025 12:19 UTC
11 points
0

“Oh,” says the computer scientist. “Well, in that case — hm. Well, utility functions are invariant under scaling, so how about you scale the two utility functions U1 and U2 such that the AI expects it can get the same utility from each of them, so it doesn’t have an incentive one way or the other.”

That can work for a single moment, but not much longer. The AI’s options change over time. For instance, whenever it has a setback, its expected U1-utility drops, so then it would mash the shutdown button to get all that sweet, sweet shutdown utility.

“Ah,” says the computer scientist. “Well, in that case, how about if [some other clever idea]?”

Well, you see, that clever idea is isomorphic to the AI believing that it’s impossible for the button to ever be pressed, which incentivizes it to terrify the user whenever it gets a setback, so as to correlate setbacks with button-presses, which (relative to its injured belief system) causes it to think the setbacks can’t happen.

And so on.

Lessons from the Trenches

We ran some workshops, and the workshops had various mathematicians of various stripes (including an International Mathematical Olympiad gold medalist), but nobody came up with a really good idea.

This passage sniped me a bit. I thought about it for a few seconds and found what felt like a good idea. A few minutes more and I couldn’t find any faults, so I wrote a quick post. Then Abram saw it and suggested that I should look back and compare it with Stuart’s old corrigibility papers.

And indeed: it turned out my idea was very similar to Stuart’s “utility indifference” idea plus a known tweak to avoid the “managing the news” problem. To me it fully solves the narrow problem of how to swap between U1 and U2 at arbitrary moments, without giving the AI incentive to control the swap button at any moment. And since Nate was also part of the discussion back then, it makes me wonder a bit why the book describes this as an open problem (or at least implies that).

For completeness sake, here’s a simple rephrasing of the idea, copy-pasted from my post yesterday which I ended up removing because it wasn’t new work:

Imagine two people, Alice and Bob, wandering around London. Bob’s goal is to get to the Tower Bridge. When he gets there, he’ll get a reward of £1 per minute of time remaining until midnight, so he’s incentivized to go fast. He’s also carrying a radio receiver.

Alice is also walking around, doing some chores of her own which we don’t need to be concerned with. She is carrying a radio transmitter with a button. If/when the button is pressed (maybe because Alice presses it, or Bob takes it from her and presses it, or she randomly bumps into something), Bob gets notified that his goal changes: there’ll be no more reward for getting to Tower Bridge, he needs to get to St Paul’s Cathedral instead. His reward formula also changes: the device notes Bob’s location at the time the button is pressed, calculates the expected travel times to Tower Bridge and to St Paul’s from that location, and adds or subtracts a payment so that the expected reward stays the same. For example, if Bob is 20 minutes away from the bridge and 30 minutes away from the cathedral when the button is pressed, the reward will be increased by £10 to compensate for the 10 minutes of delay.

I think this can serve as a toy model of corrigibility, with Alice as the “operator” and Bob as the “AI”. It’s clear enough that Bob has no incentive to manipulate the button at any point, but actually Bob’s indifference goes even further than that. For example, let’s say Bob can sacrifice just a minute of travel time to choose an alternate route, one which will take him close to both Tower Bridge and St Paul’s, to prepare for both eventualities in case Alice decides to press the button. Will he do so? No. He won’t spare even one second. He’ll take the absolute fastest way to Tower Bridge, secure in the knowledge that if the button gets pressed while he’s on the move, the reward will get adjusted and he won’t lose anything.

We can also make the setup more complicated and the general approach will still work. For example, let’s say traffic conditions change unpredictably during the day, slowing Bob down or speeding him up. Then all we need to say is that the button does the calculation at the time it’s pressed, taking into account the traffic conditions and projections at the time of button press.

Are we unrealistically relying on the button having magical calculation abilities? Not necessarily. Formally speaking, we don’t need the button to do any calculation at all. Instead, we can write out Bob’s utility function as a big complicated case statement which is fixed from the start: “if the button gets pressed at time T when I’m at position P, then my reward will be calculated as...” and so on. Or maybe this calculation is done after the fact, by the actuary who pays out Bob’s reward, knowing everything that happened. The formal details are pretty flexible.