Oh my point wasn’t against solomonoff in general, maybe more crisply, my clam is different decision theories will find different “pathologies” in the solomonoff prior, and in particular for causal and evidential decision theorists, I could totally buy the misaligned prior bit, and I could totally buy, if formalized, the whole thing rests on the interaction between bad decision theory and solomonoff.
But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn’t matter what the decision theory is.
My world model would have a loose model of myself in it, and this will change which worlds I’m more or less likely to be found in. For example, a logical decision theorist, trying to model omega, will have very low probability that omega has predicted it will two box.
no, I am not going to do what the evil super-simple-simulators want me to do because they will try to invade my prior iff (I would act like they have invaded my prior iff they invade my prior)
In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?
But I’ll expand: An agent doing that kind of game-theory reasoning needs to model the situation it’s in. And to do that modelling it needs a prior. Which might be malign.
Malign agents in the prior don’t feel like malign agents in the prior, from the perspective of the agent with the prior. They’re just beliefs about the way the world is. You need beliefs in order to choose actions. You can’t just decide to act in a way that is independent of your beliefs, because you’ve decided your beliefs are out to get you.
On top of this, how would you even decide that your beliefs are out to get you? Isn’t this also a belief?
Let M be an agent which can be instantiated in a much simpler world and has different goals from our limited Bayesian agent A. We say M is malign with respect to A if p(q|O)<p(qM,A|O) where q is the “real” world and qM,A is the world where M has decided to simulate all of A’s observations for the purpose of trying to invade their prior.
Now what influences p(qM,A|O)? Well M will only simulate all of A’s observations if it expects this will give it some influence over A. Let LA be an unformalized logical counterfactual operation that A could make.
Then p(qM,A|O,LA) is maximal when LA takes into account M‘s simulation, and 0 when LA doesn’t take into account M‘s simulation. In particular, if LA,¬M is a logical counterfactual which doesn’t take M’s simulation into account, then
p(qM,A|O,LA,¬M)=0<p(q|O,LA,¬M)
So the way in which the agent “gets its beliefs” about the structure of the decision theory problem is via these logical-counterfactual-conditional operations, same as in causal decision theory, and same as in evidential decision theory.
I’m not sure what the type signature of LA is, or what it means to “not take into account M’s simulation”. When A makes decisions about which actions to take, it doesn’t have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to “not take it into account”?
So the way in which the agent “gets its beliefs” about the structure of the decision theory problem is via these logical-counterfactual-conditional operation
I think you’ve misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it’s in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That’s all I meant by “get its beliefs”.
I’m not sure what the type signature of LA is, or what it means to “not take into account M’s simulation”
I know you know about logical decision theory, and I know you know its not formalized, and I’m not going to be able to formalize it in a LessWrong comment, so I’m not sure what you want me to say here. Do you reject the idea of logical counterfactuals? Do you not see how they could be used here?
I think you’ve misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it’s in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That’s all I meant by “get its beliefs”.
Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −101010^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
This one. I’m confused about what the intuitive intended meaning of the symbol is. Sorry, I see why “type signature” was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe LA is a boolean fact that is edited? But if so I don’t know which fact it is, and I’m confused by the way you described it.
Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −101010^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
Can we replace this with: “The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to −1010 utility.”? This is what it’s like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.
Oh my point wasn’t against solomonoff in general, maybe more crisply, my clam is different decision theories will find different “pathologies” in the solomonoff prior, and in particular for causal and evidential decision theorists, I could totally buy the misaligned prior bit, and I could totally buy, if formalized, the whole thing rests on the interaction between bad decision theory and solomonoff.
But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn’t matter what the decision theory is.
My world model would have a loose model of myself in it, and this will change which worlds I’m more or less likely to be found in. For example, a logical decision theorist, trying to model omega, will have very low probability that omega has predicted it will two box.
How does this connect to malign prior problems?
Well my response to this was:
But I’ll expand: An agent doing that kind of game-theory reasoning needs to model the situation it’s in. And to do that modelling it needs a prior. Which might be malign.
Malign agents in the prior don’t feel like malign agents in the prior, from the perspective of the agent with the prior. They’re just beliefs about the way the world is. You need beliefs in order to choose actions. You can’t just decide to act in a way that is independent of your beliefs, because you’ve decided your beliefs are out to get you.
On top of this, how would you even decide that your beliefs are out to get you? Isn’t this also a belief?
Let M be an agent which can be instantiated in a much simpler world and has different goals from our limited Bayesian agent A. We say M is malign with respect to A if p(q|O)<p(qM,A|O) where q is the “real” world and qM,A is the world where M has decided to simulate all of A’s observations for the purpose of trying to invade their prior.
Now what influences p(qM,A|O)? Well M will only simulate all of A’s observations if it expects this will give it some influence over A. Let LA be an unformalized logical counterfactual operation that A could make.
Then p(qM,A|O,LA) is maximal when LA takes into account M‘s simulation, and 0 when LA doesn’t take into account M‘s simulation. In particular, if LA,¬M is a logical counterfactual which doesn’t take M’s simulation into account, then
p(qM,A|O,LA,¬M)=0<p(q|O,LA,¬M) So the way in which the agent “gets its beliefs” about the structure of the decision theory problem is via these logical-counterfactual-conditional operations, same as in causal decision theory, and same as in evidential decision theory.
I’m not sure what the type signature of LA is, or what it means to “not take into account M’s simulation”. When A makes decisions about which actions to take, it doesn’t have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to “not take it into account”?
I think you’ve misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it’s in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That’s all I meant by “get its beliefs”.
I know you know about logical decision theory, and I know you know its not formalized, and I’m not going to be able to formalize it in a LessWrong comment, so I’m not sure what you want me to say here. Do you reject the idea of logical counterfactuals? Do you not see how they could be used here?
Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −101010^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
This one. I’m confused about what the intuitive intended meaning of the symbol is. Sorry, I see why “type signature” was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe LA is a boolean fact that is edited? But if so I don’t know which fact it is, and I’m confused by the way you described it.
Can we replace this with: “The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to −1010 utility.”? This is what it’s like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.