I am fairly confident that I understand your intentions here. A quick summary, just to test myself:
HAL cares only about world states in which an extremely unlikely thermodynamic even occurs- namely, the world in which one hundred random bits are generated spontaneously during a specific time interval. HAL is perfectly aware that these are unlikely events, but cannot act in such a way as to make the event more likely. HAL will therefore increase total utility over all possible worlds where the unlikely even occurs, and otherwise ignore the consequences of its choices.
This time interval corresponds by design with an actual signal being sent. HAL expects the signal to be sent, with a very small chance that it will be overwritten by spontaneously generated bits and thus be one of the words where it wants to maximize utility. Within the domain of world states that the machine cares about, the string of bits is random. There is a string among all these worlds states that corresponds to the signal, but it is the world where that signal is generated randomly by the spontaneously generated bits. Thus, within the domain of interest to HAL, the signal is extremely unlikely, whereas within all domains known to HAL, the signal is extremely likely to occur by means of not being overwritten in the first place. Therefore, the machine’s behavior will treat the actual signal in a counterfactual way despite HAL’s object-level knowledge that the signal will occur with high probability.
If that’s correct, then it seems like a very interesting proposal!
I do see at least one difference between this setup, and a legitimate counterfactual belief. In particular, you’ve got to worry about behavior in which (1-epsilon)% of all possible worlds have a constant utility. It may not be strictly equivalent to the simple counterfactual belief. Suppose, in a preposterous example, that there exists some device which marginally increases your ability to detect thermodynamic miracles (or otherwise increases your utility during such a miracle); unfortunately, if no thermodynamic miracle is detected, it explodes and destroys the Earth. If you simply believe in the usual way that a thermodynamic miracle is very likely to occur, you might not want to use the device, since it’s got catastrophic consequences for the world where your expectation is false. But if the non-miraculous world states are simply irrelevant, then you’d happily use the device.
As I think about it, I think maybe the real weirdness comes from the fact that your AI doesn’t have to worry about the possibility of it being wrong about there having been a thermodynamic miracle. If it responds to the false belief that a thermodynamic miracle has occurred, there can be no negative consequences.
It can account for the ‘minimal’ probability that the signal itself occurs, of course- that’s included in the ‘epsilon’ domain of worlds that it cares about. But when the signal went through, the AI would not necessarily be acting in a reasonable way on the probability that this was a non-miraculous event.
I am fairly confident that I understand your intentions here. A quick summary, just to test myself:
HAL cares only about world states in which an extremely unlikely thermodynamic even occurs- namely, the world in which one hundred random bits are generated spontaneously during a specific time interval. HAL is perfectly aware that these are unlikely events, but cannot act in such a way as to make the event more likely. HAL will therefore increase total utility over all possible worlds where the unlikely even occurs, and otherwise ignore the consequences of its choices.
This time interval corresponds by design with an actual signal being sent. HAL expects the signal to be sent, with a very small chance that it will be overwritten by spontaneously generated bits and thus be one of the words where it wants to maximize utility. Within the domain of world states that the machine cares about, the string of bits is random. There is a string among all these worlds states that corresponds to the signal, but it is the world where that signal is generated randomly by the spontaneously generated bits. Thus, within the domain of interest to HAL, the signal is extremely unlikely, whereas within all domains known to HAL, the signal is extremely likely to occur by means of not being overwritten in the first place. Therefore, the machine’s behavior will treat the actual signal in a counterfactual way despite HAL’s object-level knowledge that the signal will occur with high probability.
If that’s correct, then it seems like a very interesting proposal!
I do see at least one difference between this setup, and a legitimate counterfactual belief. In particular, you’ve got to worry about behavior in which (1-epsilon)% of all possible worlds have a constant utility. It may not be strictly equivalent to the simple counterfactual belief. Suppose, in a preposterous example, that there exists some device which marginally increases your ability to detect thermodynamic miracles (or otherwise increases your utility during such a miracle); unfortunately, if no thermodynamic miracle is detected, it explodes and destroys the Earth. If you simply believe in the usual way that a thermodynamic miracle is very likely to occur, you might not want to use the device, since it’s got catastrophic consequences for the world where your expectation is false. But if the non-miraculous world states are simply irrelevant, then you’d happily use the device.
As I think about it, I think maybe the real weirdness comes from the fact that your AI doesn’t have to worry about the possibility of it being wrong about there having been a thermodynamic miracle. If it responds to the false belief that a thermodynamic miracle has occurred, there can be no negative consequences.
It can account for the ‘minimal’ probability that the signal itself occurs, of course- that’s included in the ‘epsilon’ domain of worlds that it cares about. But when the signal went through, the AI would not necessarily be acting in a reasonable way on the probability that this was a non-miraculous event.
Yep, that’s pretty much it.