Whoops, illusion of transparency! The Arbital page is the best I’ve found (for the long-term view); the rest I reasoned on my own and sharpened in some conversations with MIRI staff.
What do you think about Paul Christiano’s argument in the comment to that Arbital page?
Do you think avoiding side effects / low impact could work if an AGI was given a task like “make money” or “win this war” that unavoidably has lots of side effects? If so, can you explain why or give a rough idea of how that might work?
(Feel free not to answer if you don’t have well formed thoughts on these questions. I’m curious what people working on this topic think about these questions, and don’t mean to put you in particular on the spot.)
It seems like Paul’s proposed solution here depends on the rest of Paul’s scheme working (you need the human’s opinions on what effects are important to be accurate). Of course if Paul’s scheme works in general, then it can be used for avoiding undesirable side effects.
My current understanding of how a task-directed AGI could work is: it has some multi-level world model that is mappable to a human-understood ontology (e.g. an ontology in which there is spacetime and objects), and you give it a goal that is something like “cause this variable here to be this value at this time step”. In general you want causal consequences of changing the variable to happen, but few other effects.
It may be possible to use the concept of a causal
counterfactual (as formalized by Pearl [2000]) to separate some intended effects from some unintended ones. Roughly, “follow-on effects” could be defined as those that are causally downstream from the achievement of the goal of building the house (such as the effect of allowing the operator to live somewhere). Follow-on effects are likely to be intended and other effects are likely to be unintended, although the correspondence is not perfect. With some additional work, perhaps it will be possible to use the causal structure of the system’s world-model to select a policy that has the follow-on effects of the goal achievement but few other effects.
For things like “make money” there are going to be effects other than you having more money, e.g. some product was sold and others have less money. The hope here is that, since you have ontology mapping, you can (a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
I think “win this war” is going to be a pretty difficult goal to formalize (as a bunch of what is implied by “winning a war” is psychological/sociological); probably it is better to think about achieving specific military objectives.
I realize I’m shoving most of the problem into the ontology mapping / transparency problem; I think this is correct, and that this problem should be prioritized, with the understanding that avoiding unintended side effects will
be one use of the ontology mapping system.
EDIT: also worth mentioning that things get weird when humans are involved. One effect of a robot building a house is that someone sees a robot building a house, but how does this effect get formalized? I am not sure whether the right approach will be to dodge the issue (by e.g. using only very simple models of humans) or to work out some ontology for theory of mind that could allow reasoning about these sorts of effects.
(a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
Are you aware of any previous discussion of this, in any papers or posts? I’m skeptical that there’s a good way to implement this scoring function. For example we do want our AI to make money by inventing, manufacturing, and selling useful gadgets, and we don’t want our AI to make money by hacking into a bank, selling a biological weapon design to a terrorist, running a Ponzi scheme, or selling gadgets that may become fire hazards. I don’t see how to accomplish this without the scoring function being a utility function. Can you perhaps explain more about how “conservatism” might work here?
It should definitely take desiderata into account, I just mean it doesn’t have to be VNM. One reason why it might not be VNM is if it’s trying to produce a non-dangerous distribution over possible outcomes rather than an outcome that is not dangerous in expectation; see Quantilizers for an example of this.
In general things like “don’t have side effects” are motivated by robustness desiderata, where we don’t trust the AI to make certain decisions so would rather it be conservative. We might not want the AI to cause X but also not want the AI to cause not-X. Things like this are likely to be non-VNM.
Whoops, illusion of transparency! The Arbital page is the best I’ve found (for the long-term view); the rest I reasoned on my own and sharpened in some conversations with MIRI staff.
What do you think about Paul Christiano’s argument in the comment to that Arbital page?
Do you think avoiding side effects / low impact could work if an AGI was given a task like “make money” or “win this war” that unavoidably has lots of side effects? If so, can you explain why or give a rough idea of how that might work?
(Feel free not to answer if you don’t have well formed thoughts on these questions. I’m curious what people working on this topic think about these questions, and don’t mean to put you in particular on the spot.)
My current thoughts on this:
It seems like Paul’s proposed solution here depends on the rest of Paul’s scheme working (you need the human’s opinions on what effects are important to be accurate). Of course if Paul’s scheme works in general, then it can be used for avoiding undesirable side effects.
My current understanding of how a task-directed AGI could work is: it has some multi-level world model that is mappable to a human-understood ontology (e.g. an ontology in which there is spacetime and objects), and you give it a goal that is something like “cause this variable here to be this value at this time step”. In general you want causal consequences of changing the variable to happen, but few other effects.
From this paper I wrote:
For things like “make money” there are going to be effects other than you having more money, e.g. some product was sold and others have less money. The hope here is that, since you have ontology mapping, you can (a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
I think “win this war” is going to be a pretty difficult goal to formalize (as a bunch of what is implied by “winning a war” is psychological/sociological); probably it is better to think about achieving specific military objectives.
I realize I’m shoving most of the problem into the ontology mapping / transparency problem; I think this is correct, and that this problem should be prioritized, with the understanding that avoiding unintended side effects will be one use of the ontology mapping system.
EDIT: also worth mentioning that things get weird when humans are involved. One effect of a robot building a house is that someone sees a robot building a house, but how does this effect get formalized? I am not sure whether the right approach will be to dodge the issue (by e.g. using only very simple models of humans) or to work out some ontology for theory of mind that could allow reasoning about these sorts of effects.
Are you aware of any previous discussion of this, in any papers or posts? I’m skeptical that there’s a good way to implement this scoring function. For example we do want our AI to make money by inventing, manufacturing, and selling useful gadgets, and we don’t want our AI to make money by hacking into a bank, selling a biological weapon design to a terrorist, running a Ponzi scheme, or selling gadgets that may become fire hazards. I don’t see how to accomplish this without the scoring function being a utility function. Can you perhaps explain more about how “conservatism” might work here?
It should definitely take desiderata into account, I just mean it doesn’t have to be VNM. One reason why it might not be VNM is if it’s trying to produce a non-dangerous distribution over possible outcomes rather than an outcome that is not dangerous in expectation; see Quantilizers for an example of this.
In general things like “don’t have side effects” are motivated by robustness desiderata, where we don’t trust the AI to make certain decisions so would rather it be conservative. We might not want the AI to cause X but also not want the AI to cause not-X. Things like this are likely to be non-VNM.