Jeremy Gillen comments on Changing my mind about Christiano’s malign prior argument

Jeremy Gillen 4 Apr 2025 20:44 UTC
2 points
0
Do you not see how they could be used here?
This one. I’m confused about what the intuitive intended meaning of the symbol is. Sorry, I see why “type signature” was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe $L_{A}$ is a boolean fact that is edited? But if so I don’t know which fact it is, and I’m confused by the way you described it.
Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −10¹⁰10^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
Can we replace this with: “The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to −10¹⁰ utility.”? This is what it’s like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.