Garrett Baker comments on Changing my mind about Christiano’s malign prior argument

Garrett Baker 4 Apr 2025 16:55 UTC
2 points
0
Oh my point wasn’t against solomonoff in general, maybe more crisply, my clam is different decision theories will find different “pathologies” in the solomonoff prior, and in particular for causal and evidential decision theorists, I could totally buy the misaligned prior bit, and I could totally buy, if formalized, the whole thing rests on the interaction between bad decision theory and solomonoff.
- Jeremy Gillen 4 Apr 2025 17:08 UTC
  2 points
  0
  Parent
  But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn’t matter what the decision theory is.
  - Garrett Baker 4 Apr 2025 17:15 UTC
    2 points
    0
    Parent
    My world model would have a loose model of myself in it, and this will change which worlds I’m more or less likely to be found in. For example, a logical decision theorist, trying to model omega, will have very low probability that omega has predicted it will two box.
    - Jeremy Gillen 4 Apr 2025 17:18 UTC
      2 points
      0
      Parent
      How does this connect to malign prior problems?
      - Garrett Baker 4 Apr 2025 17:25 UTC
        2 points
        0
        Parent
        
        no, I am not going to do what the evil super-simple-simulators want me to do because they will try to invade my prior iff (I would act like they have invaded my prior iff they invade my prior)
        
        Jeremy Gillen 4 Apr 2025 17:44 UTC
        2 points
        0
        Parent
        Well my response to this was:
        In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?
        But I’ll expand: An agent doing that kind of game-theory reasoning needs to model the situation it’s in. And to do that modelling it needs a prior. Which might be malign.
        Malign agents in the prior don’t feel like malign agents in the prior, from the perspective of the agent with the prior. They’re just beliefs about the way the world is. You need beliefs in order to choose actions. You can’t just decide to act in a way that is independent of your beliefs, because you’ve decided your beliefs are out to get you.
        On top of this, how would you even decide that your beliefs are out to get you? Isn’t this also a belief?
        Garrett Baker 4 Apr 2025 18:37 UTC
        2 points
        0
        Parent
        Let $M$ be an agent which can be instantiated in a much simpler world and has different goals from our limited Bayesian agent $A$ . We say $M$ is malign with respect to $A$ if $p (q | O) < p (q_{M, A} | O)$ where $q$ is the “real” world and $q_{M, A}$ is the world where $M$ has decided to simulate all of $A$ ’s observations for the purpose of trying to invade their prior.
        
        Now what influences $p (q_{M, A} | O)$ ? Well $M$ will only simulate all of $A$ ’s observations if it expects this will give it some influence over $A$ . Let $L_{A}$ be an unformalized logical counterfactual operation that $A$ could make.
        
        Then $p (q_{M, A} | O, L_{A})$ is maximal when $L_{A}$ takes into account $M$ ‘s simulation, and $0$ when $L_{A}$ doesn’t take into account $M$ ‘s simulation. In particular, if $L_{A, \neg M}$ is a logical counterfactual which doesn’t take $M$ ’s simulation into account, then
        
        $p (q_{M, A} | O, L_{A, \neg M}) = 0 < p (q | O, L_{A, \neg M})$ So the way in which the agent “gets its beliefs” about the structure of the decision theory problem is via these logical-counterfactual-conditional operations, same as in causal decision theory, and same as in evidential decision theory.
        Jeremy Gillen 4 Apr 2025 20:05 UTC
        2 points
        0
        Parent
        I’m not sure what the type signature of $L_{A}$ is, or what it means to “not take into account $M$ ’s simulation”. When $A$ makes decisions about which actions to take, it doesn’t have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to “not take it into account”?
        So the way in which the agent “gets its beliefs” about the structure of the decision theory problem is via these logical-counterfactual-conditional operation
        I think you’ve misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it’s in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That’s all I meant by “get its beliefs”.
        Garrett Baker 4 Apr 2025 20:17 UTC
        1 point
        0
        Parent
        
        I’m not sure what the type signature of $L_{A}$ is, or what it means to “not take into account $M$ ’s simulation”
        
        I know you know about logical decision theory, and I know you know its not formalized, and I’m not going to be able to formalize it in a LessWrong comment, so I’m not sure what you want me to say here. Do you reject the idea of logical counterfactuals? Do you not see how they could be used here?
        
        I think you’ve misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it’s in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That’s all I meant by “get its beliefs”.
        
        Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −10¹⁰10^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
        Jeremy Gillen 4 Apr 2025 20:44 UTC
        2 points
        0
        Parent
        Do you not see how they could be used here?
        This one. I’m confused about what the intuitive intended meaning of the symbol is. Sorry, I see why “type signature” was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe $L_{A}$ is a boolean fact that is edited? But if so I don’t know which fact it is, and I’m confused by the way you described it.
        Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −10¹⁰10^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
        Can we replace this with: “The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to −10¹⁰ utility.”? This is what it’s like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.