We have an AI which controls a robotic arm inside a box. The box also contains various cooking tools and ingredients to make cake, and a big red button that kills a puppy.
We prefer to cake to no cake, and we prefer our cake to be delicious and moist cake, but above all we prefer the companion cu-… puppy to stay alive.
Therefore, we implement in the AI a “puppy safety module” (PSM) which vetoes any course of action proposed by the planning module if it determines that there is any non-negligible probability of the red button being depressed. When the PSM can’t make an accurate prediction it will always err on the safe side: things like throwing tools on the box walls where they can bounce unpredictably and hit the red button don’t get approved. Athena, the Greek goddess of wisdom, justice and math, came down from Mt. Olympus and debugged the code for PSM. She was also so kind to check all the compilers, OS and hardware for bugs and glitches.
Do you agree that with this AI the puppy is most likely safe?
Now, let’s assume that the cakes made by this AI aren’t very good. We don’t really know how to program it to make better cakes, we try our best reinforcement learning algorithms but it doesn’t improve much. Then somebody suggest we try to allow the AI to modify it own code: In addition to sending motor commands to the arm, the planning module can also edit a copy of the AI code and eventually replace the whole program with this successor. Of course, these actions must still be approved by the PSM, but they are otherwise unrestricted: in principle they can modify the PSM, or add a bypass, or get rid of it entirely.
Do you think that this AI is unsafe for the puppy?
Do you think it is safe but unable to improve itself in effective ways?
Suppose that in the recesses of the code there is an instantiation of the bubble sort algorithm. The planner proposes to change it with, say, merge sort. Do you think that the PSM would generally disapprove such change?
Do you think it would approve it, but it would still be unable to approve modifications that would be needed for significant improvement?
Thought experiment:
We have an AI which controls a robotic arm inside a box. The box also contains various cooking tools and ingredients to make cake, and a big red button that kills a puppy.
We prefer to cake to no cake, and we prefer our cake to be delicious and moist cake, but above all we prefer the companion cu-… puppy to stay alive.
Therefore, we implement in the AI a “puppy safety module” (PSM) which vetoes any course of action proposed by the planning module if it determines that there is any non-negligible probability of the red button being depressed.
When the PSM can’t make an accurate prediction it will always err on the safe side: things like throwing tools on the box walls where they can bounce unpredictably and hit the red button don’t get approved.
Athena, the Greek goddess of wisdom, justice and math, came down from Mt. Olympus and debugged the code for PSM. She was also so kind to check all the compilers, OS and hardware for bugs and glitches.
Do you agree that with this AI the puppy is most likely safe?
Now, let’s assume that the cakes made by this AI aren’t very good. We don’t really know how to program it to make better cakes, we try our best reinforcement learning algorithms but it doesn’t improve much. Then somebody suggest we try to allow the AI to modify it own code:
In addition to sending motor commands to the arm, the planning module can also edit a copy of the AI code and eventually replace the whole program with this successor. Of course, these actions must still be approved by the PSM, but they are otherwise unrestricted: in principle they can modify the PSM, or add a bypass, or get rid of it entirely.
Do you think that this AI is unsafe for the puppy?
Do you think it is safe but unable to improve itself in effective ways?
Since the PSM was designed without self-modification in mind, “safe but unable to improve itself in effective ways”.
(Not sure how this thought experiment helps the discussion along.)
Can you please motivate?
Suppose that in the recesses of the code there is an instantiation of the bubble sort algorithm. The planner proposes to change it with, say, merge sort. Do you think that the PSM would generally disapprove such change?
Do you think it would approve it, but it would still be unable to approve modifications that would be needed for significant improvement?