(a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
Are you aware of any previous discussion of this, in any papers or posts? I’m skeptical that there’s a good way to implement this scoring function. For example we do want our AI to make money by inventing, manufacturing, and selling useful gadgets, and we don’t want our AI to make money by hacking into a bank, selling a biological weapon design to a terrorist, running a Ponzi scheme, or selling gadgets that may become fire hazards. I don’t see how to accomplish this without the scoring function being a utility function. Can you perhaps explain more about how “conservatism” might work here?
It should definitely take desiderata into account, I just mean it doesn’t have to be VNM. One reason why it might not be VNM is if it’s trying to produce a non-dangerous distribution over possible outcomes rather than an outcome that is not dangerous in expectation; see Quantilizers for an example of this.
In general things like “don’t have side effects” are motivated by robustness desiderata, where we don’t trust the AI to make certain decisions so would rather it be conservative. We might not want the AI to cause X but also not want the AI to cause not-X. Things like this are likely to be non-VNM.
Are you aware of any previous discussion of this, in any papers or posts? I’m skeptical that there’s a good way to implement this scoring function. For example we do want our AI to make money by inventing, manufacturing, and selling useful gadgets, and we don’t want our AI to make money by hacking into a bank, selling a biological weapon design to a terrorist, running a Ponzi scheme, or selling gadgets that may become fire hazards. I don’t see how to accomplish this without the scoring function being a utility function. Can you perhaps explain more about how “conservatism” might work here?
It should definitely take desiderata into account, I just mean it doesn’t have to be VNM. One reason why it might not be VNM is if it’s trying to produce a non-dangerous distribution over possible outcomes rather than an outcome that is not dangerous in expectation; see Quantilizers for an example of this.
In general things like “don’t have side effects” are motivated by robustness desiderata, where we don’t trust the AI to make certain decisions so would rather it be conservative. We might not want the AI to cause X but also not want the AI to cause not-X. Things like this are likely to be non-VNM.