Garrett Baker comments on Changing my mind about Christiano’s malign prior argument

Garrett Baker 4 Apr 2025 20:17 UTC
1 point
0

I’m not sure what the type signature of $L_{A}$ is, or what it means to “not take into account $M$ ’s simulation”

I know you know about logical decision theory, and I know you know its not formalized, and I’m not going to be able to formalize it in a LessWrong comment, so I’m not sure what you want me to say here. Do you reject the idea of logical counterfactuals? Do you not see how they could be used here?

I think you’ve misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it’s in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That’s all I meant by “get its beliefs”.

Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −10¹⁰10^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
- Jeremy Gillen 4 Apr 2025 20:44 UTC
  2 points
  0
  Parent
  Do you not see how they could be used here?
  This one. I’m confused about what the intuitive intended meaning of the symbol is. Sorry, I see why “type signature” was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe $L_{A}$ is a boolean fact that is edited? But if so I don’t know which fact it is, and I’m confused by the way you described it.
  Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −10¹⁰10^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
  Can we replace this with: “The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to −10¹⁰ utility.”? This is what it’s like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.