Suppose we are considering an agent with a more “positive” mission than that of your pest control drone (whose purpose is best expressed negatively: get rid of small pests). For instance, perhaps the agent is working for a hedge fund and trying to increase the value of its holdings, or perhaps it’s trying to improve human health and give people longer healthier lives.
How do you express that in terms of “disutility”?
I think what is doing the work here is not using “disutility” rather than “utility”, but having a utility function that’s (something like) bounded above and that can’t be driven sky-high by (what we would regard as) weird and counterproductive actions. (So, for the “positive” agents above, rather than forcing what they do into a “disutility” framework, one could give the hedge fund machine a utility function that stops increasing after the value of the fund reaches $100bn, and the health machine a utility function that stops increasing after 95% of people are getting 70QALYs or more, or something like that.) And then some counterbalancing, not artificially bounded, negative term (“number of humans harmed” in your example; maybe more generally some measure of “amount of change” would do, though I suspect that would be hard to express rigorously) should ensure that the machine never has reason to do anything too drastic.
So: yeah, I think this is far from crazy, but I don’t think it’s going to solve the Friendly AI problem, for a few reasons:
A system of this kind can only ever do a limited amount of good. I suppose you get around that by making a new one with a broadly similar utility function but larger bounds, once the first one has finished its job without destroying humanity. The overall effect is a kind of hill-climbing algorithm: improve the world as much as you like, but each step has to be not too large and human beings step in and take stock after each step.
You are at risk of being overtaken by another system with fewer scruples about large changes—in particular, by one that doesn’t require repeated human intervention.
Relatedly, this doesn’t seem like the kind of restriction that’s stable under self-modification; we aren’t going to bootstrap our way to a quick positive singularity this way without serious risk of disaster.
To be sure that a system of this kind really is safe, that “don’t do too much harm” term in its (dis)utility function really wants to be quite general. (Caricature of the kind of failure you want to avoid: your bug-killer figures out a new insecticide and a means of delivering it widely; it doesn’t harm anyone now alive, but it does have reproductive effects, with the eventual consequence that people two generations from now will be 20 IQ points stupider or something. But no particular person is worse off.) But (1) this is going to be really hard to specify and (2) it’s likely that everything the system can think of has some long-range consequences that might be bad, so very likely it ends up never doing anything.
I agree on all points. It seems “bounded utility” might be a better term than “disutility”. The main point is that a halting condition triggered by success, and a system that is essentially trying to find the conditions where it can shut itself off, seems less likely to go horribly wrong than an unbounded search for ever more utility.
This is not an attempt to solve Friendly AI. I just figure a simple hard-coded limit to how much of anything a learning machine could want chops off a couple of avenues for disaster.
Suppose we are considering an agent with a more “positive” mission than that of your pest control drone (whose purpose is best expressed negatively: get rid of small pests). For instance, perhaps the agent is working for a hedge fund and trying to increase the value of its holdings, or perhaps it’s trying to improve human health and give people longer healthier lives.
How do you express that in terms of “disutility”?
I think what is doing the work here is not using “disutility” rather than “utility”, but having a utility function that’s (something like) bounded above and that can’t be driven sky-high by (what we would regard as) weird and counterproductive actions. (So, for the “positive” agents above, rather than forcing what they do into a “disutility” framework, one could give the hedge fund machine a utility function that stops increasing after the value of the fund reaches $100bn, and the health machine a utility function that stops increasing after 95% of people are getting 70QALYs or more, or something like that.) And then some counterbalancing, not artificially bounded, negative term (“number of humans harmed” in your example; maybe more generally some measure of “amount of change” would do, though I suspect that would be hard to express rigorously) should ensure that the machine never has reason to do anything too drastic.
So: yeah, I think this is far from crazy, but I don’t think it’s going to solve the Friendly AI problem, for a few reasons:
A system of this kind can only ever do a limited amount of good. I suppose you get around that by making a new one with a broadly similar utility function but larger bounds, once the first one has finished its job without destroying humanity. The overall effect is a kind of hill-climbing algorithm: improve the world as much as you like, but each step has to be not too large and human beings step in and take stock after each step.
You are at risk of being overtaken by another system with fewer scruples about large changes—in particular, by one that doesn’t require repeated human intervention.
Relatedly, this doesn’t seem like the kind of restriction that’s stable under self-modification; we aren’t going to bootstrap our way to a quick positive singularity this way without serious risk of disaster.
To be sure that a system of this kind really is safe, that “don’t do too much harm” term in its (dis)utility function really wants to be quite general. (Caricature of the kind of failure you want to avoid: your bug-killer figures out a new insecticide and a means of delivering it widely; it doesn’t harm anyone now alive, but it does have reproductive effects, with the eventual consequence that people two generations from now will be 20 IQ points stupider or something. But no particular person is worse off.) But (1) this is going to be really hard to specify and (2) it’s likely that everything the system can think of has some long-range consequences that might be bad, so very likely it ends up never doing anything.
I agree on all points. It seems “bounded utility” might be a better term than “disutility”. The main point is that a halting condition triggered by success, and a system that is essentially trying to find the conditions where it can shut itself off, seems less likely to go horribly wrong than an unbounded search for ever more utility.
This is not an attempt to solve Friendly AI. I just figure a simple hard-coded limit to how much of anything a learning machine could want chops off a couple of avenues for disaster.