Here, so far as I can understand it, is UDT vs. ordinary DT for paper clips:
Ordinary DT (“ODT”) says: at all times t, act so as to maximize the number of paper clips that will be observed at time (t + 1), where “1″ is a long time and we don’t have to worry about discount rates.
UDT says: in each situation s, take the action that is returns the highest value on an internal lookup table that has been incorporated into me as part of my programming, which, incidentally, was programmed by people who loved paper clips.
Suppose ODT and UDT are fairly dumb, say, as smart as a cocker spaniel.
Suppose we put both agents on the set of the movie Office Space. ODT will scan the area, evaluate the situation, and simulate several different courses of action, one of which is bending staples into paper clips. Other models might include hiding, talking to accountants, and attempting to program a paper clip screensaver using Microsoft Office. The model that involves bending staples shows the highest number of paper clips in the future compared to other models, so the ODT will start bending staples. If the ODT is later surprised to discover that the boss has walked in and confiscated the staples, it will be “sad” because it did not get as much paper-clip utility as it expected to, and it will mentally adjust the utility of the “bend staples” model downward, especially when it detects boss-like objects. In the future, this may lead ODT to adopt different courses of behavior, such as “bend staples until you see boss, then hide.” The reason for changing course and adopting these other behaviors is that they would have relatively higher utility in its modeling scheme.
UDT will scan the area, evaluate the situation, and categorize the situation as situation #7, which roughly corresponds to “metal available, no obvious threats, no other obvious resources,” and lookup the correct action for situation #7, which its programmers have specified is “bend staples into paper clips.” Accordingly, UDT will bend staples. If UDT is later surprised to discover that the boss has wandered in and confiscated the staples, it will not care. The UDT will continue to be confident that it did the “right” thing by following its instructions for the given situation, and would behave exactly the same way if it encountered a similar situation.
UDT sounds stupider, and, at cocker-spaniel levels of intelligence, it undoubtedly is. That’s why evolution designed cocker-spaniels to run on ODT, which is much more Pavlovian. However, UDT has the neat little virtue that it is immune to a Counterfactual Mugging. If we could somehow design a UDT that was arbitrarily intelligent, it would both achieve great results and win in a situation where ODT failed.
Here, so far as I can understand it, is Tyrell’s UDT vs. ordinary DT for paper clips:
For god’s sake, don’t call it my UDT :D. My post already seems to be giving some people the impression that I was suggesting some amendment or improvement to Wei Dai’s UDT.
Here, so far as I can understand it, is UDT vs. ordinary DT for paper clips:
Ordinary DT (“ODT”) says: at all times t, act so as to maximize the number of paper clips that will be observed at time (t + 1), where “1″ is a long time and we don’t have to worry about discount rates.
UDT says: in each situation s, take the action that is returns the highest value on an internal lookup table that has been incorporated into me as part of my programming, which, incidentally, was programmed by people who loved paper clips.
Suppose ODT and UDT are fairly dumb, say, as smart as a cocker spaniel.
Suppose we put both agents on the set of the movie Office Space. ODT will scan the area, evaluate the situation, and simulate several different courses of action, one of which is bending staples into paper clips. Other models might include hiding, talking to accountants, and attempting to program a paper clip screensaver using Microsoft Office. The model that involves bending staples shows the highest number of paper clips in the future compared to other models, so the ODT will start bending staples. If the ODT is later surprised to discover that the boss has walked in and confiscated the staples, it will be “sad” because it did not get as much paper-clip utility as it expected to, and it will mentally adjust the utility of the “bend staples” model downward, especially when it detects boss-like objects. In the future, this may lead ODT to adopt different courses of behavior, such as “bend staples until you see boss, then hide.” The reason for changing course and adopting these other behaviors is that they would have relatively higher utility in its modeling scheme.
UDT will scan the area, evaluate the situation, and categorize the situation as situation #7, which roughly corresponds to “metal available, no obvious threats, no other obvious resources,” and lookup the correct action for situation #7, which its programmers have specified is “bend staples into paper clips.” Accordingly, UDT will bend staples. If UDT is later surprised to discover that the boss has wandered in and confiscated the staples, it will not care. The UDT will continue to be confident that it did the “right” thing by following its instructions for the given situation, and would behave exactly the same way if it encountered a similar situation.
UDT sounds stupider, and, at cocker-spaniel levels of intelligence, it undoubtedly is. That’s why evolution designed cocker-spaniels to run on ODT, which is much more Pavlovian. However, UDT has the neat little virtue that it is immune to a Counterfactual Mugging. If we could somehow design a UDT that was arbitrarily intelligent, it would both achieve great results and win in a situation where ODT failed.
For god’s sake, don’t call it my UDT :D. My post already seems to be giving some people the impression that I was suggesting some amendment or improvement to Wei Dai’s UDT.
Edited. [grin]