Yes, the AI won’t care about self-preservation; but it also won’t care about any other interim values we’d like to program it with, except ones that amount to patterns of sensory experience for the AI.
I get why AIXI would behave like this, but it’s not obvious to me that all Cartesian AIs would probably have this problem. If the AI has some model of the world, and this model can still update (mostly correctly) based on what the sensory channel inputs, and predict (mostly correctly) how different outputs can change the world, it seems like it could still try to maximize making as many paperclips as possible according to its model of the world. Does that make sense?
That’s a good point. AIXI is my go-to example, and AIXI’s preferences are over its input tape. But, sticking to the cybernetic agent model, there are other action-dependent things Alice could have preferences over, like portions of her work tape, or her actions themselves. She could also have preferences over input-conditional logical constructs out of Everett’s program, like Everett’s work tape contents.
I agree it’s possible to build a non-AIXI-like Cartesian that wants to make paperclips, not just produce paperclip-experiences in itself. But Cartesians are weird, so it’s hard to predict how much progress that would represent.
For example, the Cartesian might wirehead under the assumption that doing so changes reality, instead of wireheading under the assumption that doing so changes its experiences. I don’t know whether a deeply dualistic agent would recognize that editing its camera to create paperclip hallucinations counts as editing its input sequence semi-directly. It might instead think of camera-hacking as a godlike way of editing reality as a whole, as though Alice had the power to create billions of representations of objective physical paperclips in Everett’s work tape just by editing the part of Everett’s work tape representing her hardware.
In general, I’m worried about including anything reminiscent of Cartesian reasoning in our ‘the seed AI can help us solve this’ corner-cutting category, because I don’t formally understand the precise patterns of mistakes Cartesians make well enough to think I can predict them and stay two steps ahead of those errors. And in the time it takes to figure out exactly which patches would make Cartesians safe and predictable without rendering them useless, it’s plausible we could have just built a naturalized architecture from scratch.
Thank you, this helps clarify things for me.
I get why AIXI would behave like this, but it’s not obvious to me that all Cartesian AIs would probably have this problem. If the AI has some model of the world, and this model can still update (mostly correctly) based on what the sensory channel inputs, and predict (mostly correctly) how different outputs can change the world, it seems like it could still try to maximize making as many paperclips as possible according to its model of the world. Does that make sense?
Alex Mennen designed a Cartesian with preferences over its environment: A utility-maximizing variant of AIXI.
That’s a good point. AIXI is my go-to example, and AIXI’s preferences are over its input tape. But, sticking to the cybernetic agent model, there are other action-dependent things Alice could have preferences over, like portions of her work tape, or her actions themselves. She could also have preferences over input-conditional logical constructs out of Everett’s program, like Everett’s work tape contents.
I agree it’s possible to build a non-AIXI-like Cartesian that wants to make paperclips, not just produce paperclip-experiences in itself. But Cartesians are weird, so it’s hard to predict how much progress that would represent.
For example, the Cartesian might wirehead under the assumption that doing so changes reality, instead of wireheading under the assumption that doing so changes its experiences. I don’t know whether a deeply dualistic agent would recognize that editing its camera to create paperclip hallucinations counts as editing its input sequence semi-directly. It might instead think of camera-hacking as a godlike way of editing reality as a whole, as though Alice had the power to create billions of representations of objective physical paperclips in Everett’s work tape just by editing the part of Everett’s work tape representing her hardware.
In general, I’m worried about including anything reminiscent of Cartesian reasoning in our ‘the seed AI can help us solve this’ corner-cutting category, because I don’t formally understand the precise patterns of mistakes Cartesians make well enough to think I can predict them and stay two steps ahead of those errors. And in the time it takes to figure out exactly which patches would make Cartesians safe and predictable without rendering them useless, it’s plausible we could have just built a naturalized architecture from scratch.