At least typically, we’re talking about a strategy in the following sense. Q: Suppose you want to pick a teacher for a new classroom, how should you pick a teacher? A: you randomly sample from teachers above some performance threshold, in some base distribution. This works best given some fixed finite amount of “counterfeit performance” in that distribution.
If we treat the teachers as a bunch of agents, we don’t yet have a game-theoretic argument that we should actually expect the amount of counterfeit performance (I) to be bounded. It might be that all of the teachers exploit the metric as far as they can, and counterfeit performance is unbounded...
I don’t fully understand the rest of the comment.
This is a rough draft, so pointing out any errors by email or PM is greatly appreciated.
As another anecdata point, I considered writing more to pursue the prize pool but ultimately didn’t do any more (counterfactual) work!
Note: This is bound to contain a bunch of errors and sources of confusion so please let me know about them here.
Maybe the new-conversation—place is the bar or snack-bar. (Plausible deniability!)
[Note: This comment is three years later than the post]
The “obvious idea” here unfortunately seems not to work, because it is vulnerable to so-called “infinite improbability drives”. Suppose B is a shutdown button, and P(b|e) gives some weight to B=pressed and B=unpressed. Then, the AI will benefit from selecting a Q such that it always chooses an action a, in which it enters a lottery, and if it does not win, then it the button B is pushed. In this circumstance, P(b|e) is unchanged, while both P(c|b=pressed,a,e) and P(c|b=unpressed,a,e) allocate almost all of the probability to great C outcomes. So the approach will create an AI that wants to exploit its ability to determine B.
I see. I was trying to do was answer your terminology question by addressing simple extreme cases. e.g. if you ask an AI to disconnect its shutdown button, I don’t think it’s being incorrigible. If you ask an AI to keep you safe, and then it disconnects its shutdown button, it is being incorrigible.
I think the main way the religion case differs is that the AI system is interfering with our intellectual ability for strategizing about AI rather than our physical systems for redirecting AI, and I’m not sure how that counts. But if I ask an AI to keep me safe and it mind-controls me to want to propagate that AI, that’s sure incorrigible. Maybe, as you suggest, it’s just fundamentally ill-defined...
I could be wrong, but I feel like if I ask for education or manipulation and the AI gives it to me, and bad stuff happens, that’s not a problem with the redirectibility or corrigibility of the agent. After all, it just did what it was told. Conversely, if the AI system refuses to educate me, that seems rather more like a corrigibility problem. A natural divider is that with a corrigibility AI we can still inflict harm on ourselves via our use of that AI as a tool.
Does this sound right?
A corrigible AI might not turn against its operators and might not kill us all, and the outcome can still be catastrophic. To prevent this, we’d definitely want our operators to be metaphilosophically competent, and we’d definitely want our AI to not corrupt them.
I agree with this.
a corrigible misaligned superintelligence is unlikely to lead to self-annihilation, but pretty likely to lead to astronomical moral waste.
There’s a lot of broad model uncertainty here, but yes, I’m sympathetic to this position.
Does the new title seem better?
At this round of edits, my main objection would be to the remark that the AI wants us to act as yes-men, which seems dubious if the agent is (i) an Act-based agent or (ii) sufficiently broadly uncertain over values.
What I see to be the main message of the article as currently written is that humans controlling a very powerful tool (especially AI) could drive themselves into a suboptimal fixed point due to insufficient philosophical sophistication.
This I agree with.
It seems to me that for a corrigible, moderately superhuman AI, it is mostly the metaphilosophical competence of the human that matters, rather than that of the AI system. I think there are a bunch of confusions presented here, and I’ll run through them, although let me disclaim that it’s Eliezer’s notion of corrigibility that I’m most familiar with, and so I’m arguing that your critiques fall flat on Eliezer’s version.
“[The AI should] figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...”
You omitted a key component of the quote that almost entirely reversis its meaning. The correct quote would read [emphasis added]: “[The AI should] help me figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...“. i.e. the AI should help with ensuring that the control continues to reside in the human, rather than in itself.
The messiah would in his heart of hearts have the best of intentions for them, and everyone would know that.
To my understanding, the point of corrigibility is that a corrigible system is supposed to benefit its human operators even if its intentions are somewhat wrong, so it is rather a non sequitur to say that an agent is corrigible because it has the best of intentions in its heart of hearts. If it truly fully understood human intentions and values, corrigibility may even be unnecessary.
He might also think it’s a good idea for his followers to all drink cyanide together, or murder some pregnant actresses, and his followers might happily comply.
Clearly you’re right that corrigibility is not sufficient for safety. A corrigible agent can still be instructed by its human operators to make a decision that is irreversibly bad. But it seems to help, and to help a lot. The point of a corrigible AI si that once it takes a few murderous actions, you can switch it off, or tell it to pursue a different objective. So for the messiah example, a corrigible messiah might poison a few followers and then when it is discovered, respond to an instruction to desist. An incorrigible messiah might be asked to stop murdering followers, but continue to do so anyway. So many of the more mundane existential risks would be mitigated by corrigibility.
And what about more exotic ones? I argue they would also be greatly (though not entirely) reduced. Consider that a corrigible messiah may still hide poison for all of the humans at once, leading to an irrevocably terrible outcome. But why should it? If it thinks it is doing well by the humans, then its harmful actions ought to be transparent. Perhaps the AI system would’s actions would not be transparent if it intelligence was so radically great that it was inclined to act in fast an incomprehensible ways. But it is hard to see how we could know with confidence that such a radically intelligent AI is the kind we will soon be dealing with. And even if we are going to deal with that kind of AI, there could be other remedies that would be especially helpful in such scenarios. For example, an AI that permits informed oversight of its activities could be superb if it was already corrigible. Then, it could not only provide truthful explanations of its future plans but also respond to feedback on them. Overall, if we had an AI system that was (1) only a little bit superhumanly smart, (2) corrigible, and (3) providing informative explanations of its planned behaviour, then it would seem that we are in a pretty good spot.
“This is absurd. Wouldn’t they obviously have cared about animal suffering if they’d reflected on it, and chosen to do something about it before blissing themselves out?”
Yeah, but they never got around to that before blissing themselves out.
I think you’re making an important point here, but here is how I would put it: If you have an AI system that is properly deferential to humans, you still need to rely on the humans not to give it any existentially catastrophic orders. But the corrigibility/deferential behavior has changed the situation from one in which you’re relying on the metaphilosophical competence of the AI, to one in which you’re relying on the metaphilosphical competence on the human (albeit as filtered through the actions of the AI system). In the latter case, yes, you need to survive having a human’s power increased by some N-fold. (Not necessarily 10^15 as in the more extreme self-improvement scenarios, but by some N>1). So when you get a corrigible AI, you still need to be very careful with what you tell it to do, but your situation is substantially improved. Note that what I’m saying is at least in some tension with the traditional story of indirect normativity. Rather than trying to give the AI very general instructions for its interpretation, I’m saying that we should in the first instance try to stabilize the world so that we can do more metaphilosophical reasoning ourselves before trying to program an AI system that can carry out the conclusions of that thinking or perhaps continue it.
Would it want to? I think yes, because it’s incentivized not to optimize for human values, but to turn humans into yes-men… The only thing I can imagine that would robustly prevent this manipulation is to formally guarantee the AI to be metaphilosophically competent itself.
Yes, an approval-directed agent might reward-hack by causing the human to approve of things that it does not value. And it might compromise the humans’ reasoning abilities while doing so. But why must the AI system’s metaphilosophical competence be the only defeator? Why couldn’t this be achieved by quantilizing, or otherwise throttling the agent’s capabilities? By restricting the agent’s activities to some narrow domain? By having the agent somehow be deeply uncertain about where the human’s approval mechanism resides? None of these seems clearly viable, but neither do any of them seem clearly impossible, especially in cases where the AI system’s capabilities are overall not far beyond those of its human operators.
Overall, I’d say superintelligent messiahs are sometimes corrigible, and they’re more likely to be aligned if so.
One phenomenon that has some of the relevant characteristics of these tech gold rushes is the online Poker (and popular poker) scene of 5-10 years ago, in that in made a few nerds unusually wealthy. The level of earnings/(work x skill) was much lower in poker, though, in that it was not clearly that much higher than for things like finance.
I agree that the agent should be able to make a decent effort at telling us which of its drives are biases (/addictions) versus values. One complicating factor is that agents change their opinions about these matters over time. Imagine a philosopher who uses the drug heroin. They may very well vacillate on whether heroin satisfies their full-preferences, even if the experience of taking heroin is not changing. This could happen via introspection, via philosophical investigation, via examining fMRI scans, et cetera. It’s tricky for the human to state their biases with confidence because they may never know when they are done updating on the matter.
Intuitively, an agent might want the AI system to do this examination and then to maximize whatever turns out to be valuable. That is, you might want the bias-model to be the one that you would settle on if you thought for a long time, similarly to enlightened self-interest / extrapolated volition models. Similar problems ensue: e.g., it this process may diverge. Or it may be fundamentally indeterminate whether some drives are values or biases.
Is this any different from just saying that rationality model is the entire graph (the agent will maximize H), and the true utility function is Emo?
Seems like you need to run a Bayesian model where the AI has a prior distribution over the value of exiert and carries out some exploration/exploitation tradeoff, as in bandit problems.
So you’re saying rationality is good if your utility is linear in the quantity of some goods? (For most people it is more like logarithmic, right?) But it seems that you want to say that independent thought is usually useful...
I’m sure the 10th century peasant does have ways to have a better life, but they just don’t necessarily involve doing rationality training, which is pretty obviously does not (and should not) help in all situations. Right?
It seems like we’re anchoring excessively on the question of sufficiency, when what matters is the net expected benefit. If we rephrase the question and ask “are there populations that are made worse off, on expectation, by more independent thought?“, the answer is clearly yes, which is I think the question that we should be asking (and that fits the point I’m making).
In order to research existential risk, and to actually survive, yes, we need more thought, although this is the kind of research I had in mind in my original comment.
A few thoughts that have been brooding, that are vaguely relevant to your post...
One thing that I find is often disappointingly absent from LW discussions of epistemology is how much the appropriate epistemology depends on your goals and your intellectual abilities. If you are someone of median intelligence who just want to carry out a usual trade like making shoes or something, you can largely get by with recieved wisdom. If you are a researcher, your entire job consists of coming up with things that aren’t already present in the market of ideas, and so using at least some local epistemology (or ‘inside view’, or ‘figuring things out’) is a job requirement. If you are trying to start a start-up, or generate any kind of invention, again, you usually have to claim to have some knowledge advantage, and so you need a more local epistemology.
Relatedly, even for any individual person, the kind of thinking I should use depends very much on context. Personally, in order to do research, I try to do a lot of my thinking by myself, in order to train myself to think well. Sure, I do engage in a lot of scholarship too, and I often check my answers through discussing my thinking with others. But I do a lot more independent thinking than I did two years ago anyway. But if I am ever making a truly important decision, such as who to work for, it makes sense for me to be much more deferential, and to seek advice of people who I know to be the best at making that decision, and then to defer to them to a fairly large degree (notwithstanding that they lack some information, which I should adjust for).
It would be nice to see people relax blanket pronouncements (not claiming this is particularly worse in this post compared to elsewhere) in order to give a bit more attention to this dependence on context.
I came across this in Secret Science by Herbert Foerstel. I found it to be very interesting, although it is about 20 years out of date, and it takes on a role of advocating for freeer science, rather than balacing considerations in both directions. Does anyone have newer treatments of this policy area to recommend?