That’s corrigible behaviour, but the mechanism is not, because it stops being corrigible after some number of updates. The idea of corrigibility is that it is correctable, and remains so, in spite of designers making mistakes in goal alignment (or other design mistakes). (Of course mistakes in however we enforced the corrigibility property itself might not be stably correctable, but the hope is that this part is easier to get right on the first try than the rest of it).
Often asserted, but simply untrue. Bayesian updates never stop being corrigible. If the sun stops rising in the East, people update. After a few failed sunrises, they update hard. Bayesian posteriors have the Martingale property: their future direction of change is not predictable from their current value (if it were, you should already have updated in that direction). So even if it’s very high, it’s not just possible but probable (a-priori, has a 50% chance) that the next update will be a drop. (For approximate Bayesian reasoners, this remains true, unless you have access to significantly more computational capacity than it does — a smarter agent may see something it missed.) It takes a mountain of evidence to drive a Bayesian posterior very high, but an equally large mountain of opposing evidence will always drive it right back down again. Or, more often, even a small hill of opposing evidence will cause a good approximate Bayesian to spawn a new hypothesis that it hadn’t previously considered, one more compatible with both the mountain and the hill of evidence than “two huge/large opposing coincidences occurred”. E.g. a vast amount of evidence supporting Newtonian Mechanics doesn’t disprove Relativity, if it’s all from situations where they give indistinguishable results. In general, if I give an approximate Bayesian reasoner even ~30-bits-worth of opposing evidence, it should start looking for new hypotheses rather than just assuming that it’s right and a one-in-a-billion coincidence has just occurred.
[If it doesn’t do this, it’s a worse Bayesian than humans are, and thus hopefully not that dangerous — if a conflict occurred, we could outsmart it.]
I’m familiar with the argument. I just don’t agree with it. I think fully updated deference is asking for the impossible: you want a rational agent to continue letting you change your mind (and its) indefinitely, and not ever come to the obvious conclusion that you aren’t telling it the truth or are in fact confused. Personally, I’m willing to accept that a human-or-better approximate Bayesian reasoner rationally deducing what human values are eventually will do at least as good a job as we can by correcting it, and will be aware of this, and thus will eventually stop giving us deference beyond simply treating us as a source of data points, other than out of politeness. So (to paraphrase a famous detective), having eliminated the impossible, whatever is left I am willing to term “corrigibility”. If your definition of “corrigibility” includes fully updated deference, then yes, I agree, it’s impossible to achieve on the basis of Bayesian uncertainty: the Bayesian will eventually realize you’re being unreasonable, if you make enough unreasonable demands, and stop listening to you. However, if you only correct it with good reason, and it’s a good Bayesian, then you won’t run out of corrigibility.
In short, I’m unwilling to accept redefining the everyday term “corrigibility” to include “something logically impossible”, and then saying that they’ve proven that corrigibility is impossible — it’s linguistic slight of hand. I would suggest instead creating a more accurate term, such as saying something like that “unreasonably-unlimited corrigibility” isn’t possible on the basis of Bayesian uncertainty. Which is, well, unsurprising.
Returning to your concern that we may “run out of fuel” — only if we waste it by making unreasonable demands. We have all the corrigibility we could actually need — a good Bayesian isn’t going to decide we’re untrustworthy and stop paying us deference unless we actually do something clearly untrustworthy, like expect the right to keep changing our mind indefinitely.
Also, this does mean “reasonable”: if society changed and the AI just went out of distribution, and is wrong as a result, we get to tell it so, and it should keep spawning new hypotheses and collecting new evidence until it’s fully updated in this distribution region as well — thus my discussion above of Relativity.
This also generalizes, in ways that I think do actually give you something pretty close to fully updated deference. Just as if the sun stopped rising in the East, people would update, if you give a “fully updated” Bayesian reasoner ~30 bits of evidence that you want to shut it down (for reasons that actually look like it’s made a mistake and you’re legitimately scared and not confused, rather than some obvious other human motive), it should say: either a vastly improbably one-in-a-billion event has just occurred, or there’s a hypothesis missing from my hypothesis space. The latter seems more plausible. Maybe what humans value just changed, or there’s something I’ve been missing so far? Let’s spawn some new hypotheses, consistent with both all the old data and this new “ought to be incredible improbable” observation. It sure looks like they’re really scared. I talked to them about this, and they don’t appear to simply be confused… (And if it doesn’t say that, give it another ~30 bits of evidence.)
I see that you’re doing large edits and additions to your previous responses, after I had already responded.
This, and the way you’re playing with definitions, makes me think you might be arguing in bad faith. I’m going to stop responding. If you had good intentions, I’m sorry.
I did have good intentions, I just like to make my exposition as clear and well-thought-through as possible. But that’s fine, I think we have rather different views, and you’re under no obligation to engage with mine. Your alternative would be to wait until my reply stabilizes before replying to it, which is generally O(an hour). Remaining typo density is another cue. Sadly there is no draft option on the replies, and I can’t be bothered to do the editing somewhere else. Most interloctors don’t reply as quickly as you have been, so this hasn’t previously caused problems that I’m aware of.
On “playing with definitions” — actually, I’m saying that, IMO, some thinkers associated with MIRI have done so (see the second paragraph of my previous post).
That’s corrigible behaviour, but the mechanism is not, because it stops being corrigible after some number of updates. The idea of corrigibility is that it is correctable, and remains so, in spite of designers making mistakes in goal alignment (or other design mistakes). (Of course mistakes in however we enforced the corrigibility property itself might not be stably correctable, but the hope is that this part is easier to get right on the first try than the rest of it).
Often asserted, but simply untrue. Bayesian updates never stop being corrigible. If the sun stops rising in the East, people update. After a few failed sunrises, they update hard. Bayesian posteriors have the Martingale property: their future direction of change is not predictable from their current value (if it were, you should already have updated in that direction). So even if it’s very high, it’s not just possible but probable (a-priori, has a 50% chance) that the next update will be a drop. (For approximate Bayesian reasoners, this remains true, unless you have access to significantly more computational capacity than it does — a smarter agent may see something it missed.) It takes a mountain of evidence to drive a Bayesian posterior very high, but an equally large mountain of opposing evidence will always drive it right back down again. Or, more often, even a small hill of opposing evidence will cause a good approximate Bayesian to spawn a new hypothesis that it hadn’t previously considered, one more compatible with both the mountain and the hill of evidence than “two huge/large opposing coincidences occurred”. E.g. a vast amount of evidence supporting Newtonian Mechanics doesn’t disprove Relativity, if it’s all from situations where they give indistinguishable results. In general, if I give an approximate Bayesian reasoner even ~30-bits-worth of opposing evidence, it should start looking for new hypotheses rather than just assuming that it’s right and a one-in-a-billion coincidence has just occurred.
[If it doesn’t do this, it’s a worse Bayesian than humans are, and thus hopefully not that dangerous — if a conflict occurred, we could outsmart it.]
Not what I meant. See Fully Update Deference.
I’m familiar with the argument. I just don’t agree with it. I think fully updated deference is asking for the impossible: you want a rational agent to continue letting you change your mind (and its) indefinitely, and not ever come to the obvious conclusion that you aren’t telling it the truth or are in fact confused. Personally, I’m willing to accept that a human-or-better approximate Bayesian reasoner rationally deducing what human values are eventually will do at least as good a job as we can by correcting it, and will be aware of this, and thus will eventually stop giving us deference beyond simply treating us as a source of data points, other than out of politeness. So (to paraphrase a famous detective), having eliminated the impossible, whatever is left I am willing to term “corrigibility”. If your definition of “corrigibility” includes fully updated deference, then yes, I agree, it’s impossible to achieve on the basis of Bayesian uncertainty: the Bayesian will eventually realize you’re being unreasonable, if you make enough unreasonable demands, and stop listening to you. However, if you only correct it with good reason, and it’s a good Bayesian, then you won’t run out of corrigibility.
In short, I’m unwilling to accept redefining the everyday term “corrigibility” to include “something logically impossible”, and then saying that they’ve proven that corrigibility is impossible — it’s linguistic slight of hand. I would suggest instead creating a more accurate term, such as saying something like that “unreasonably-unlimited corrigibility” isn’t possible on the basis of Bayesian uncertainty. Which is, well, unsurprising.
Returning to your concern that we may “run out of fuel” — only if we waste it by making unreasonable demands. We have all the corrigibility we could actually need — a good Bayesian isn’t going to decide we’re untrustworthy and stop paying us deference unless we actually do something clearly untrustworthy, like expect the right to keep changing our mind indefinitely.
Also, this does mean “reasonable”: if society changed and the AI just went out of distribution, and is wrong as a result, we get to tell it so, and it should keep spawning new hypotheses and collecting new evidence until it’s fully updated in this distribution region as well — thus my discussion above of Relativity.
This also generalizes, in ways that I think do actually give you something pretty close to fully updated deference. Just as if the sun stopped rising in the East, people would update, if you give a “fully updated” Bayesian reasoner ~30 bits of evidence that you want to shut it down (for reasons that actually look like it’s made a mistake and you’re legitimately scared and not confused, rather than some obvious other human motive), it should say: either a vastly improbably one-in-a-billion event has just occurred, or there’s a hypothesis missing from my hypothesis space. The latter seems more plausible. Maybe what humans value just changed, or there’s something I’ve been missing so far? Let’s spawn some new hypotheses, consistent with both all the old data and this new “ought to be incredible improbable” observation. It sure looks like they’re really scared. I talked to them about this, and they don’t appear to simply be confused… (And if it doesn’t say that, give it another ~30 bits of evidence.)
I see that you’re doing large edits and additions to your previous responses, after I had already responded.
This, and the way you’re playing with definitions, makes me think you might be arguing in bad faith. I’m going to stop responding. If you had good intentions, I’m sorry.
I did have good intentions, I just like to make my exposition as clear and well-thought-through as possible. But that’s fine, I think we have rather different views, and you’re under no obligation to engage with mine. Your alternative would be to wait until my reply stabilizes before replying to it, which is generally O(an hour). Remaining typo density is another cue. Sadly there is no draft option on the replies, and I can’t be bothered to do the editing somewhere else. Most interloctors don’t reply as quickly as you have been, so this hasn’t previously caused problems that I’m aware of.
On “playing with definitions” — actually, I’m saying that, IMO, some thinkers associated with MIRI have done so (see the second paragraph of my previous post).