So, I was thinking about the misapplication of the deduction theorem and suddenly had an insight into friendly AI. (Stop me if this has been discussed =D )
The problem is you give the AI a set of ‘axioms’ or goals and it might go off and tile the universe. Obviously we don’t know the full consequences of the logical system we’re initiating otherwise we wouldn’t need the AI in the first place.
The problem already exists in humans. Evolution gave us ‘axioms’ or desires. Some people DO go off and tile their universe: drugs, sex or work addictions, etc, etc. Thus my insight stems from the lack of drug addictions in most people. Here are my two proposed solutions:
(1) Fear. Many people don’t do drugs not because of a lack of desire to feel good, but because they are scared. People are likewise scared of any large changes (moving, new job, end of a relationship). Now, we don’t need the AI to favor status quo, however we can simply code into ones of its axioms that large physical changes are bad. Scale this exponentially (ie, twice the physical changes SQUARES the negative weighting of the action). Do not have any positive weighting criteria on other goals that scale faster than a polynomial.
(2) Life. Many people don’t tile their universe because they have too many different things they’d like to tile it with. Hobbies, friends, lovers, variation, etc. Give the AI a multitude of goals and set the positive weight associated with their accomplishment to diminish with returns (preferably logarithmic in growth at most). Twice the tiles only gives a linear increase in gauged utility.
Volume of tiling is bounded above by polynomial growth in time (cubic in a 3D universe with speed limit), hence hitting it with a log penalty will stifle its growth to at most log(t). If you wanted to be really safe you could simply cap the possible utility of accomplishing any particular goal.
I’m missing something because this seems like a solid solution to me. I haven’t read most of Eliezer’s writings, unfortunately (no time, I tile my universe with math), so if there’s a good one that discusses this I’d appreciate a link.
I think the missing piece is that it’s really hard to formally-specify a scale of physical change.
I think the notion of “minimizing change” is secretly invoking multiple human brain abilities, which I suspect will each turn out to be very difficult to formalize. Given partial knowledge of a current situation S: (1) to predict the future states of the world if we take some hypothetical action, (2) to invent a concrete default / null action appropriate to S, and (3) to informally feel which of two hypothetical worlds is more or less “changed” with respect to the predicted null-action world.
I think (1) (2) and (3) feel so introspectively unobtrusive because we have no introspective access into them; they’re cognitive black-boxes. We just see that their outputs are nearly always available when we need them, and fail to notice the existence of the black-boxes entirely.
You’ll also require an additional ability, a stronger form of (3) which I’m not sure even humans implement: (4) given two hypothetical worlds H1 and H2, and the predicted null-action world W0, compute the ratio difference(H1, W0) / difference(H2, W0), without dangerous corner-cases.
If you can formally specify (1) (2) and (4), then yes! I then think you can use that to construct a utility function that won’t obsess (won’t “tile the universe”) using the plan you described—though I recommend investing more effort than my 30-minute musings to prove safety, if you seem poised to actually implement this plan.
Some issues I foresee:
Humans are imperfect at (1) and (2), and the (1)- and (2)-outputs are critical to not just ensuring non-obsession, but also to the intelligence quality of the AI overall. While formalizing human (1) and (2) algorithms may enable human-level general AI (a big win in its own right), superhuman AI will require non-human formalizations for (1) and (2). Inventing non-human formalizations here feels difficult and risky—though perhaps unavoidable.
The hypothetical world states in (4) are very-very-high-dimensional objects, so corner-cases in (4) seem non-trivial to rule-out. A formalization of the human (3)-implementation might be sufficient for some viable alternative plan, in which case the difficulty of formalizing (3) is bounded-above by the difficulty of reverse-engineering the human (3) neurology. By contrast, inventing an inhuman (4) could be much more difficult and risky. This may be weak evidence that plans merely requiring (3) ought to be preferred over plans requiring (4).
So, I was thinking about the misapplication of the deduction theorem and suddenly had an insight into friendly AI. (Stop me if this has been discussed =D )
The problem is you give the AI a set of ‘axioms’ or goals and it might go off and tile the universe. Obviously we don’t know the full consequences of the logical system we’re initiating otherwise we wouldn’t need the AI in the first place.
The problem already exists in humans. Evolution gave us ‘axioms’ or desires. Some people DO go off and tile their universe: drugs, sex or work addictions, etc, etc. Thus my insight stems from the lack of drug addictions in most people. Here are my two proposed solutions:
(1) Fear. Many people don’t do drugs not because of a lack of desire to feel good, but because they are scared. People are likewise scared of any large changes (moving, new job, end of a relationship). Now, we don’t need the AI to favor status quo, however we can simply code into ones of its axioms that large physical changes are bad. Scale this exponentially (ie, twice the physical changes SQUARES the negative weighting of the action). Do not have any positive weighting criteria on other goals that scale faster than a polynomial.
(2) Life. Many people don’t tile their universe because they have too many different things they’d like to tile it with. Hobbies, friends, lovers, variation, etc. Give the AI a multitude of goals and set the positive weight associated with their accomplishment to diminish with returns (preferably logarithmic in growth at most). Twice the tiles only gives a linear increase in gauged utility.
Volume of tiling is bounded above by polynomial growth in time (cubic in a 3D universe with speed limit), hence hitting it with a log penalty will stifle its growth to at most log(t). If you wanted to be really safe you could simply cap the possible utility of accomplishing any particular goal.
I’m missing something because this seems like a solid solution to me. I haven’t read most of Eliezer’s writings, unfortunately (no time, I tile my universe with math), so if there’s a good one that discusses this I’d appreciate a link.
I think the missing piece is that it’s really hard to formally-specify a scale of physical change.
I think the notion of “minimizing change” is secretly invoking multiple human brain abilities, which I suspect will each turn out to be very difficult to formalize. Given partial knowledge of a current situation S: (1) to predict the future states of the world if we take some hypothetical action, (2) to invent a concrete default / null action appropriate to S, and (3) to informally feel which of two hypothetical worlds is more or less “changed” with respect to the predicted null-action world.
I think (1) (2) and (3) feel so introspectively unobtrusive because we have no introspective access into them; they’re cognitive black-boxes. We just see that their outputs are nearly always available when we need them, and fail to notice the existence of the black-boxes entirely.
You’ll also require an additional ability, a stronger form of (3) which I’m not sure even humans implement: (4) given two hypothetical worlds H1 and H2, and the predicted null-action world W0, compute the ratio difference(H1, W0) / difference(H2, W0), without dangerous corner-cases.
If you can formally specify (1) (2) and (4), then yes! I then think you can use that to construct a utility function that won’t obsess (won’t “tile the universe”) using the plan you described—though I recommend investing more effort than my 30-minute musings to prove safety, if you seem poised to actually implement this plan.
Some issues I foresee:
Humans are imperfect at (1) and (2), and the (1)- and (2)-outputs are critical to not just ensuring non-obsession, but also to the intelligence quality of the AI overall. While formalizing human (1) and (2) algorithms may enable human-level general AI (a big win in its own right), superhuman AI will require non-human formalizations for (1) and (2). Inventing non-human formalizations here feels difficult and risky—though perhaps unavoidable.
The hypothetical world states in (4) are very-very-high-dimensional objects, so corner-cases in (4) seem non-trivial to rule-out. A formalization of the human (3)-implementation might be sufficient for some viable alternative plan, in which case the difficulty of formalizing (3) is bounded-above by the difficulty of reverse-engineering the human (3) neurology. By contrast, inventing an inhuman (4) could be much more difficult and risky. This may be weak evidence that plans merely requiring (3) ought to be preferred over plans requiring (4).