For deterministic computation: What I was trying to get at is that a traditional RL agent does some computation, gets a new input based on its actions and environment, does some more computation, and so on. (I admit that I didn’t describe this well. I edited a bit.)
Your argument about Solomonoff induction is clever but I feel like it’s missing the point. Systems with some sense of self and self-understanding don’t generally simulate themselves or form perfect models of themselves; I know I don’t! Here’s a better statement: “I am a predictive world-model, I guess I’m probably implemented on some physical hardware somewhere.” This is a true statement, and the system can believe that statement without knowing what the physical hardware is (then it can start reasoning about what the physical hardware is, looking for news stories about AI projects). I’m proposing that we can and should build world-models that don’t have this type of belief in its world model.
What I really have in mind is: There’s a large but finite space of computable predictive models (given a bit-stream, predict the next bit). We run a known algorithm that searches through this space to find the model that best fits the internet. This model is full of insightful, semantic information about the world, as this helps it make predictions. Maybe if we do it right, the best model would not be self-reflective, not knowing what it was doing as it did its predictive thing, and thus unable to reason about its internal processes or recognize causal connections between that and the world it sees (even if such connections are blatant).
One intuition is: An oracle is supposed to just answer questions. It’s not supposed to think through how its outputs will ultimately affect the world. So, one way of ensuring that it does what it’s supposed to do, is to design the oracle to not know that it is a thing that can affect the world.
Your argument about Solomonoff induction is clever but I feel like it’s missing the point.
I agree it’s missing the point. I do get the point, and I disagree with it—I wanted to say “all three cases will build self-models”; I couldn’t because that may not be true for Solomonoff induction due to an unrelated reason which as you note misses the point. I did claim that the other two cases would be self-aware as you define it.
(I agree that Solomonoff induction might build an approximate model of itself, idk.)
Maybe if we do it right, the best model would not be self-reflective, not knowing what it was doing as it did its predictive thing, and thus unable to reason about its internal processes or recognize causal connections between that and the world it sees (even if such connections are blatant).
My claim is that we have no idea how to do this, and I think the examples in your post would not do this.
One intuition is: An oracle is supposed to just answer questions. It’s not supposed to think through how its outputs will ultimately affect the world. So, one way of ensuring that it does what it’s supposed to do, is to design the oracle to not know that it is a thing that can affect the world.
I’m not disagreeing that if we could build a self-unaware oracle then we would be safe. That seems reasonably likely to fix agency issues (though I’d want to think about it more). My disagreement is on the premise of the argument, i.e. can we build self-unaware oracles at all.
I think we’re on the same page! As I noted at the top, this is a brainstorming post, and I don’t think my definitions are quite right, or that my arguments are airtight. The feedback from you and others has been super-helpful, and I’m taking that forward as I search for more a rigorous version of this, if it exists!! :-)
For deterministic computation: What I was trying to get at is that a traditional RL agent does some computation, gets a new input based on its actions and environment, does some more computation, and so on. (I admit that I didn’t describe this well. I edited a bit.)
Your argument about Solomonoff induction is clever but I feel like it’s missing the point. Systems with some sense of self and self-understanding don’t generally simulate themselves or form perfect models of themselves; I know I don’t! Here’s a better statement: “I am a predictive world-model, I guess I’m probably implemented on some physical hardware somewhere.” This is a true statement, and the system can believe that statement without knowing what the physical hardware is (then it can start reasoning about what the physical hardware is, looking for news stories about AI projects). I’m proposing that we can and should build world-models that don’t have this type of belief in its world model.
What I really have in mind is: There’s a large but finite space of computable predictive models (given a bit-stream, predict the next bit). We run a known algorithm that searches through this space to find the model that best fits the internet. This model is full of insightful, semantic information about the world, as this helps it make predictions. Maybe if we do it right, the best model would not be self-reflective, not knowing what it was doing as it did its predictive thing, and thus unable to reason about its internal processes or recognize causal connections between that and the world it sees (even if such connections are blatant).
One intuition is: An oracle is supposed to just answer questions. It’s not supposed to think through how its outputs will ultimately affect the world. So, one way of ensuring that it does what it’s supposed to do, is to design the oracle to not know that it is a thing that can affect the world.
I agree it’s missing the point. I do get the point, and I disagree with it—I wanted to say “all three cases will build self-models”; I couldn’t because that may not be true for Solomonoff induction due to an unrelated reason which as you note misses the point. I did claim that the other two cases would be self-aware as you define it.
(I agree that Solomonoff induction might build an approximate model of itself, idk.)
My claim is that we have no idea how to do this, and I think the examples in your post would not do this.
I’m not disagreeing that if we could build a self-unaware oracle then we would be safe. That seems reasonably likely to fix agency issues (though I’d want to think about it more). My disagreement is on the premise of the argument, i.e. can we build self-unaware oracles at all.
On further reflection, you’re right, the Solomonoff induction example is not obvious. I put a correction in my post, thanks again.
I think we’re on the same page! As I noted at the top, this is a brainstorming post, and I don’t think my definitions are quite right, or that my arguments are airtight. The feedback from you and others has been super-helpful, and I’m taking that forward as I search for more a rigorous version of this, if it exists!! :-)