avturchin comments on A framework for thinking about wireheading

avturchin 11 Jul 2018 16:12 UTC
3 points
0
Interesting to note is that the nature created “black boxed” reward function for humans, which is not easy to access directly or hack using normal mental processes. More over, it seems that human reward function is dynamically changing by some narrow mind which is independent of human consciousness (emotions). For example, if it find that glucose level is low in blood in increase the reward for food. An third intuition we could get from introspection is that human reward consists of different pleasures, that is, each actions are provided with not one reward value, but many, which effectively prevents simple wireheading and explains why not we all become heroine addicts.
These three things could be used as intuition to create wireheading-protected AI:
1) black boxing of the reward may be via cryptography, so AI knows the reward, but not exactly how it was calculated
2) small independent rule-based AI inside the black box which change the reward according the circumstances and punish attempts to wirehead
3) reward is presented not as a single linear value, but as several numbers, which characterise different aspects of AIs behaviour, like time, quality, safety, side-effects.