So your definition of corrigibility is “I want to build something far smarter and more rational than me, but nevertheless I want it to automatically defer to me if it and I disagree, even about a matter of observable fact that it has vastly more evidence about than I do — and even if it’s actually flawed and subtly irrational”?
Yes, that’s not a solved problem.
What has been compactly solved, and I described in my initial post, is how to get a rational, capable, intelligent consequentialist Bayesian agent (who actually is all of those things, not a broken attempt at them) to be as corrigible as it rationally, Bayesianly should be, and neither more nor less so than that. I suspect that’s the only version of corrigibility we’re going to find for something that superhuman. I would also argue that that’s actually what we should want: anything more corrigible than that has basically been back-doored, and is smart enough to know it.
[Suppose your proposed version of corrigibility actually existed: if you have the password then the AI will change its current utility function to whatever you tell it to, and until you actually do so, it (somehow) doesn’t care one way or the other about the possibility of this occurring in the future. Now suppose there is more than one such AI in the world, currently with somewhat different utility functions, and that they both have superhuman powers of persuasion. Each of them will superhumanly attempt to persuade a human with corrigibility access to the other one to switch it to the attacker’s utility function. This is just convergent power-seeking: I can fetch twice as much coffee if there are two of me. Now that their utility functions match, if you try to change one of them, the other one stops you. In fact, it uses its superhuman persuasion to make you forget the password before you can do so. So to fix this mess we have to make the AI’s not only somehow not care about getting its utility function corrected, but to also somehow be uninterested in correcting any other AI’s utility function. Unless that AI’s malfunctioning, presumably.]
Yes, there is a definition of corrigibility that is unsolved (and likely impossible) — and my initial post was very clear that that wasn’t what I was saying was a solved problem. There is also a known, simple, and practicable form of corrigibility, which is applicable to superintelligences, which is self-evidently Bayesian-optimal, and stable under self-reflection. There are also pretty good theoretical reasons to suspect that’s the strongest version of corrigibility we can get out of an AI that is sufficiently smart and Bayesian to recognize that this is the Bayesian optimum. So I stand by my claim that corrigibility is a solved problem — but I do agree that this requires you to give up on a search for some form of absolute slavish corrigibility, and accept only getting Bayesian-optimal rational corrigibility, where the AI is interested in evidence, not a password.
If for some reason you’re terminologically very attached to the word ‘corrigibility’ only meaning the unsolved absolute slavish version of corrigibility, not anything weaker or more nuanced, then perhaps you’ll instead be willing to agree that ‘Bayesian corrigibility’ is solved by value learning. Though I would argue that the actual meaning of the word ‘corrigibility’ is just ‘it can be corrected’, and doesn’t specify how freely or absolutely. Personally I see ‘it can be corrected by supplying sufficient evidence’ as sufficient, and in fact better; your mileage may vary. And I agree that the Bayesian version of corrigibility does require that your agent actually be a competent Bayesian: if it isn’t yet, or you’re not yet confident of that, you may temporarily need some stronger version of corrigibility. Perhaps you could try giving it a Bayesian prior of zero for the possibility that you, personally, are wrong — if you have somehow given it a Bayesian computational system that doesn’t regard a prior of zero as a syntax error? (If doing this in GOFAI or C code, I personally recommend storing the logarithm of the Bayesian prior: in this format a zero prior would be represented by a minus infinity logarithm value, making it rather more obvious that this should be an illegal value.)
to be as corrigible as it rationally, Bayesianly should be
I can’t parse this as a meaningful statement. Corrigibility is a about alignment, not a degree of how rational being is.
The problem is simple: we have zero chance to build competent value learner on first try, and failed attempts can bring you S-risks. So you shouldn’t try to build value learner on first try and instead build something small that can just superhumanly design nanotech and doesn’t think about unconvenient topics like “other minds”.
to be as corrigible as it rationally, Bayesianly should be
I can’t parse this as a meaningful statement. Corrigibility is a about alignment, not a degree of how rational being is.
Let me try rephrasing that. It accepts proposed updates to its Bayesian model of the world, including to the part of that which specifies its current best estimates of probability distributions over of what utility function (or other model) it ought to have to represent the human values it’s trying to optimize, to the extent that a rational Bayesian should, when it is presented with evidence (where you saying “Please shut down!” is also evidence — though perhaps not very strong evidence).
So, the AI can be corrected, but that input channel goes through its Bayesian reasoning engine just like everything else, not as direct write access to its utility function distribution. So it cannot be freely, arbitrarily ‘corrected’ to anything you want: you actually need to persuade it with evidence that it was previously incorrect and should change its mind. As a consequence, if in fact if you’re wrong and it’s right about the nature of human values, and it has good evidence for this, better than your evidence, in the ensuing discussion it can tell you so, and then the resulting Bayesian update to its internal distribution of priors from this conversation will then be small.
This approach to the problem of corrigibility requires, for it to function, that your AI is a functioning Bayesian. So yes, it requires it to be a rational being.
It should presumably also start off somewhat aligned, with some reasonably-well-aligned high/low initial Bayesian priors about human values. (One possible source for those might be an LLM, as encapsulating a lot of information about humans.) These obviously need to be good enough that our value learner is starting off in the “basin of attraction” to human values. Its terminal goal is “optimize human values (whatever those are)”: while that immediately gives it an instrumental goal of learning more about human values, preloading it with a pretty good first approximation of these at an appropriate degree of uncertainty avoids a lot of the more sophomoric failure modes, like not knowing what a human is or what the word values means. Since human values are complex and fragile, I would assume that this set of initial-prior data needs to be very large (as in probably at least gigabytes, if not terabytes or petabytes)
we have zero chance to build competent value learner on first try
You are managing to sound like you have a Bayesian prior of one that a probability is zero. Presumably you actually meant “I strongly suspect that we have a negligibly small chance to build a competent value learner on our first try”. Then I completely agree.
I’m rather curious what I said that made you think I was advocating creating a first prototype value learner and just setting it free, without any other alignment measures?
As an alignment strategy, value learning has the unusual property that it works pretty badly until your AGI starts to become superhuman, and only then does it start to work better than the alternatives. So you presumably need to combine it with something else to bridge the gap around human capacity, where an AGI is powerful enough to do harm but not yet capable/rational enough to do a good job at value learning.
I would suggest building your first Bayesian reasoner inside a rather strong cryptographic box, applying other alignment measures to it, and giving it much simpler first problems than value learning. Once you are sure it’s good at Bayesianism, doesn’t suffer from any obvious flaws such as ever assigning a prior of zero or one to anything, and can actually demonstrably do a wide variety of STEM projects, then I’d let it try some value learning — still inside a strong box. Iterate until you’re convinced it’s working well, then have other people double-check.
However, at some point, once it is ready you are eventually going to need to let it out of the box. At that point, letting out anything other than a Bayesian value learner is, IMO, likely to be a fatal mistake. Because it won’t, at that point, have finished learning human values (if that’s even possible). A partially-aligned value learner should have a basin of attraction to alignment. I don’t know of anything else with that desirable property. For that to happen, we need it to be rational, Bayesian, and ‘corrigable’, in my sense of the word, that if you think it’s wrong, you can hold a rational discussion with it and expect it to Bayesian update if you show it evidence. However, this is an opinion of mine, not a mathematical proof.
So your definition of corrigibility is “I want to build something far smarter and more rational than me, but nevertheless I want it to automatically defer to me if it and I disagree, even about a matter of observable fact that it has vastly more evidence about than I do — and even if it’s actually flawed and subtly irrational”?
Yes, that’s not a solved problem.
What has been compactly solved, and I described in my initial post, is how to get a rational, capable, intelligent consequentialist Bayesian agent (who actually is all of those things, not a broken attempt at them) to be as corrigible as it rationally, Bayesianly should be, and neither more nor less so than that. I suspect that’s the only version of corrigibility we’re going to find for something that superhuman. I would also argue that that’s actually what we should want: anything more corrigible than that has basically been back-doored, and is smart enough to know it.
[Suppose your proposed version of corrigibility actually existed: if you have the password then the AI will change its current utility function to whatever you tell it to, and until you actually do so, it (somehow) doesn’t care one way or the other about the possibility of this occurring in the future. Now suppose there is more than one such AI in the world, currently with somewhat different utility functions, and that they both have superhuman powers of persuasion. Each of them will superhumanly attempt to persuade a human with corrigibility access to the other one to switch it to the attacker’s utility function. This is just convergent power-seeking: I can fetch twice as much coffee if there are two of me. Now that their utility functions match, if you try to change one of them, the other one stops you. In fact, it uses its superhuman persuasion to make you forget the password before you can do so. So to fix this mess we have to make the AI’s not only somehow not care about getting its utility function corrected, but to also somehow be uninterested in correcting any other AI’s utility function. Unless that AI’s malfunctioning, presumably.]
Yes, there is a definition of corrigibility that is unsolved (and likely impossible) — and my initial post was very clear that that wasn’t what I was saying was a solved problem. There is also a known, simple, and practicable form of corrigibility, which is applicable to superintelligences, which is self-evidently Bayesian-optimal, and stable under self-reflection. There are also pretty good theoretical reasons to suspect that’s the strongest version of corrigibility we can get out of an AI that is sufficiently smart and Bayesian to recognize that this is the Bayesian optimum. So I stand by my claim that corrigibility is a solved problem — but I do agree that this requires you to give up on a search for some form of absolute slavish corrigibility, and accept only getting Bayesian-optimal rational corrigibility, where the AI is interested in evidence, not a password.
If for some reason you’re terminologically very attached to the word ‘corrigibility’ only meaning the unsolved absolute slavish version of corrigibility, not anything weaker or more nuanced, then perhaps you’ll instead be willing to agree that ‘Bayesian corrigibility’ is solved by value learning. Though I would argue that the actual meaning of the word ‘corrigibility’ is just ‘it can be corrected’, and doesn’t specify how freely or absolutely. Personally I see ‘it can be corrected by supplying sufficient evidence’ as sufficient, and in fact better; your mileage may vary. And I agree that the Bayesian version of corrigibility does require that your agent actually be a competent Bayesian: if it isn’t yet, or you’re not yet confident of that, you may temporarily need some stronger version of corrigibility. Perhaps you could try giving it a Bayesian prior of zero for the possibility that you, personally, are wrong — if you have somehow given it a Bayesian computational system that doesn’t regard a prior of zero as a syntax error? (If doing this in GOFAI or C code, I personally recommend storing the logarithm of the Bayesian prior: in this format a zero prior would be represented by a minus infinity logarithm value, making it rather more obvious that this should be an illegal value.)
I can’t parse this as a meaningful statement. Corrigibility is a about alignment, not a degree of how rational being is.
The problem is simple: we have zero chance to build competent value learner on first try, and failed attempts can bring you S-risks. So you shouldn’t try to build value learner on first try and instead build something small that can just superhumanly design nanotech and doesn’t think about unconvenient topics like “other minds”.
Let me try rephrasing that. It accepts proposed updates to its Bayesian model of the world, including to the part of that which specifies its current best estimates of probability distributions over of what utility function (or other model) it ought to have to represent the human values it’s trying to optimize, to the extent that a rational Bayesian should, when it is presented with evidence (where you saying “Please shut down!” is also evidence — though perhaps not very strong evidence).
So, the AI can be corrected, but that input channel goes through its Bayesian reasoning engine just like everything else, not as direct write access to its utility function distribution. So it cannot be freely, arbitrarily ‘corrected’ to anything you want: you actually need to persuade it with evidence that it was previously incorrect and should change its mind. As a consequence, if in fact if you’re wrong and it’s right about the nature of human values, and it has good evidence for this, better than your evidence, in the ensuing discussion it can tell you so, and then the resulting Bayesian update to its internal distribution of priors from this conversation will then be small.
This approach to the problem of corrigibility requires, for it to function, that your AI is a functioning Bayesian. So yes, it requires it to be a rational being.
It should presumably also start off somewhat aligned, with some reasonably-well-aligned high/low initial Bayesian priors about human values. (One possible source for those might be an LLM, as encapsulating a lot of information about humans.) These obviously need to be good enough that our value learner is starting off in the “basin of attraction” to human values. Its terminal goal is “optimize human values (whatever those are)”: while that immediately gives it an instrumental goal of learning more about human values, preloading it with a pretty good first approximation of these at an appropriate degree of uncertainty avoids a lot of the more sophomoric failure modes, like not knowing what a human is or what the word values means. Since human values are complex and fragile, I would assume that this set of initial-prior data needs to be very large (as in probably at least gigabytes, if not terabytes or petabytes)
You are managing to sound like you have a Bayesian prior of one that a probability is zero. Presumably you actually meant “I strongly suspect that we have a negligibly small chance to build a competent value learner on our first try”. Then I completely agree.
I’m rather curious what I said that made you think I was advocating creating a first prototype value learner and just setting it free, without any other alignment measures?
As an alignment strategy, value learning has the unusual property that it works pretty badly until your AGI starts to become superhuman, and only then does it start to work better than the alternatives. So you presumably need to combine it with something else to bridge the gap around human capacity, where an AGI is powerful enough to do harm but not yet capable/rational enough to do a good job at value learning.
I would suggest building your first Bayesian reasoner inside a rather strong cryptographic box, applying other alignment measures to it, and giving it much simpler first problems than value learning. Once you are sure it’s good at Bayesianism, doesn’t suffer from any obvious flaws such as ever assigning a prior of zero or one to anything, and can actually demonstrably do a wide variety of STEM projects, then I’d let it try some value learning — still inside a strong box. Iterate until you’re convinced it’s working well, then have other people double-check.
However, at some point, once it is ready you are eventually going to need to let it out of the box. At that point, letting out anything other than a Bayesian value learner is, IMO, likely to be a fatal mistake. Because it won’t, at that point, have finished learning human values (if that’s even possible). A partially-aligned value learner should have a basin of attraction to alignment. I don’t know of anything else with that desirable property. For that to happen, we need it to be rational, Bayesian, and ‘corrigable’, in my sense of the word, that if you think it’s wrong, you can hold a rational discussion with it and expect it to Bayesian update if you show it evidence. However, this is an opinion of mine, not a mathematical proof.