Thought a bit about the problem. Presumably, there’s some way to determine whether an AI will behave nicely now and in the future. It’s not a general solution, but it’s able to verify perpetual nice behavior in the case where the president dies April 1. I don’t know the details, so I’ll just treat it as a black box where I can enter some initial conditions and it will output “Nice”, “Not Nice”, or “Unknown”. In this framework, we have a situation where the only known input that returned “Nice” involved the president’s death on April 1.
If you’re using any kind of Bayesian reasoning, you’re not going to assign probability 1 to any nontrivial statements. So, the AI would assign some probability to “The president died April 1” and is known to become nice when that probability crosses a certain threshold.
What are the temporal constraints? Does the threshold have to be reached by a certain date? What is the minimum duration for which the probability has to be above this threshold? Here’s where one can experiment using the black box. If it is determined, for example, that the AI only needs to hold the belief for an hour, then one may be able to box the AI, give it a false prior for an hour, then expose it to enough contrary evidence for it to update its beliefs to properly reflect the real world.
What if the AI is known to be nice only as long as it believes the president to have died April 1? That would mean that if, say, six months later one managed to trick the AI into believing the president didn’t die, then we would no longer know whether it was nice. So either the AI only requires the belief for a certain time period, or else the very foundation of its niceness is suspect.
Thought a bit about the problem. Presumably, there’s some way to determine whether an AI will behave nicely now and in the future. It’s not a general solution, but it’s able to verify perpetual nice behavior in the case where the president dies April 1. I don’t know the details, so I’ll just treat it as a black box where I can enter some initial conditions and it will output “Nice”, “Not Nice”, or “Unknown”. In this framework, we have a situation where the only known input that returned “Nice” involved the president’s death on April 1.
If you’re using any kind of Bayesian reasoning, you’re not going to assign probability 1 to any nontrivial statements. So, the AI would assign some probability to “The president died April 1” and is known to become nice when that probability crosses a certain threshold.
What are the temporal constraints? Does the threshold have to be reached by a certain date? What is the minimum duration for which the probability has to be above this threshold? Here’s where one can experiment using the black box. If it is determined, for example, that the AI only needs to hold the belief for an hour, then one may be able to box the AI, give it a false prior for an hour, then expose it to enough contrary evidence for it to update its beliefs to properly reflect the real world.
What if the AI is known to be nice only as long as it believes the president to have died April 1? That would mean that if, say, six months later one managed to trick the AI into believing the president didn’t die, then we would no longer know whether it was nice. So either the AI only requires the belief for a certain time period, or else the very foundation of its niceness is suspect.
Here are some other ways the problem can go wrong: http://lesswrong.com/r/discussion/lw/mbq/the_president_didnt_die_failures_at_extending_ai/