RogerDearnaley comments on The corrigibility basin of attraction is a misleading gloss

RogerDearnaley 21 Dec 2025 4:45 UTC
2 points
0
I’m familiar with the argument. I just don’t agree with it. I think fully updated deference is asking for the impossible: you want a rational agent to continue letting you change your mind (and its) indefinitely, and not ever come to the obvious conclusion that you aren’t telling it the truth or are in fact confused. Personally, I’m willing to accept that a human-or-better approximate Bayesian reasoner rationally deducing what human values are eventually will do at least as good a job as we can by correcting it, and will be aware of this, and thus will eventually stop giving us deference beyond simply treating us as a source of data points, other than out of politeness. So (to paraphrase a famous detective), having eliminated the impossible, whatever is left I am willing to term “corrigibility”. If your definition of “corrigibility” includes fully updated deference, then yes, I agree, it’s impossible to achieve on the basis of Bayesian uncertainty: the Bayesian will eventually realize you’re being unreasonable, if you make enough unreasonable demands, and stop listening to you. However, if you only correct it with good reason, and it’s a good Bayesian, then you won’t run out of corrigibility.

In short, I’m unwilling to accept redefining the everyday term “corrigibility” to include “something logically impossible”, and then saying that they’ve proven that corrigibility is impossible — it’s linguistic slight of hand. I would suggest instead creating a more accurate term, such as saying something like that “unreasonably-unlimited corrigibility” isn’t possible on the basis of Bayesian uncertainty. Which is, well, unsurprising.

Returning to your concern that we may “run out of fuel” — only if we waste it by making unreasonable demands. We have all the corrigibility we could actually need — a good Bayesian isn’t going to decide we’re untrustworthy and stop paying us deference unless we actually do something clearly untrustworthy, like expect the right to keep changing our mind indefinitely.

Also, this does mean “reasonable”: if society changed and the AI just went out of distribution, and is wrong as a result, we get to tell it so, and it should keep spawning new hypotheses and collecting new evidence until it’s fully updated in this distribution region as well — thus my discussion above of Relativity.

This also generalizes, in ways that I think do actually give you something pretty close to fully updated deference. Just as if the sun stopped rising in the East, people would update, if you give a “fully updated” Bayesian reasoner ~30 bits of evidence that you want to shut it down (for reasons that actually look like it’s made a mistake and you’re legitimately scared and not confused, rather than some obvious other human motive), it should say: either a vastly improbably one-in-a-billion event has just occurred, or there’s a hypothesis missing from my hypothesis space. The latter seems more plausible. Maybe what humans value just changed, or there’s something I’ve been missing so far? Let’s spawn some new hypotheses, consistent with both all the old data and this new “ought to be incredible improbable” observation. It sure looks like they’re really scared. I talked to them about this, and they don’t appear to simply be confused… (And if it doesn’t say that, give it another ~30 bits of evidence.)
- Jeremy Gillen 21 Dec 2025 5:26 UTC
  2 points
  0
  Parent
  I see that you’re doing large edits and additions to your previous responses, after I had already responded.
  This, and the way you’re playing with definitions, makes me think you might be arguing in bad faith. I’m going to stop responding. If you had good intentions, I’m sorry.
  - RogerDearnaley 21 Dec 2025 12:55 UTC
    4 points
    0
    Parent
    I did have good intentions, I just like to make my exposition as clear and well-thought-through as possible. But that’s fine, I think we have rather different views, and you’re under no obligation to engage with mine. Your alternative would be to wait until my reply stabilizes before replying to it, which is generally O(an hour). Remaining typo density is another cue. Sadly there is no draft option on the replies, and I can’t be bothered to do the editing somewhere else. Most interloctors don’t reply as quickly as you have been, so this hasn’t previously caused problems that I’m aware of.
    
    On “playing with definitions” — actually, I’m saying that, IMO, some thinkers associated with MIRI have done so (see the second paragraph of my previous post).