If you think of a planning AI as a probability pump (moves probability from the default distribution of possible universes into its decision boundary), then there would be two obvious ways to design it :
You could give it a floating point number labelled ‘happiness’, write a routine that increments happiness when it fulfills its utility function, and design it to optimize for a high happiness number. THAT system will wirehead as soon as it it learns its own internal structure.
Or, you could simply design it to optimize for the universe complying with its utility function. Provided it has a strong grasp of map-territory relations, that system should never wirehead, because when it tries to find a path of causality that maximizes, say, ‘more paperclips,’ the symbolic representation for ‘I think there are a lot of paperclips’ is in a totally different part of concept space than the one for ‘there are a lot of paperclips,’ and changing that variable won’t help it maximize its utility function at all.
This issue is whether it is possible to properly formalise the concept of “territory ” and program it into the machine. What is the territory? How do you know it is the territory? Can the concept of “territory” be formalised so that not even a superintelligent machine can distinguish between your means of identifying the territory and the actual territory? If so: how do you do that?
The issue might get into some hairy philosophical areas—if the machine finds out (or becomes convinced) that it is inside a simulation.
I personally have no objection to living my life in a nice well built simulation (along with friends living in it too). Is it me confusing map with a territory or is me in a simulator equivalent to me outside simulator?
Is there even any territory to the notion of self, anyway?
If you think of a planning AI as a probability pump (moves probability from the default distribution of possible universes into its decision boundary), then there would be two obvious ways to design it :
You could give it a floating point number labelled ‘happiness’, write a routine that increments happiness when it fulfills its utility function, and design it to optimize for a high happiness number. THAT system will wirehead as soon as it it learns its own internal structure.
Or, you could simply design it to optimize for the universe complying with its utility function. Provided it has a strong grasp of map-territory relations, that system should never wirehead, because when it tries to find a path of causality that maximizes, say, ‘more paperclips,’ the symbolic representation for ‘I think there are a lot of paperclips’ is in a totally different part of concept space than the one for ‘there are a lot of paperclips,’ and changing that variable won’t help it maximize its utility function at all.
This issue is whether it is possible to properly formalise the concept of “territory ” and program it into the machine. What is the territory? How do you know it is the territory? Can the concept of “territory” be formalised so that not even a superintelligent machine can distinguish between your means of identifying the territory and the actual territory? If so: how do you do that?
The issue might get into some hairy philosophical areas—if the machine finds out (or becomes convinced) that it is inside a simulation.
Yep. There’s also this thing:
I personally have no objection to living my life in a nice well built simulation (along with friends living in it too). Is it me confusing map with a territory or is me in a simulator equivalent to me outside simulator?
Is there even any territory to the notion of self, anyway?
Join Cypher.