This one. I’m confused about what the intuitive intended meaning of the symbol is. Sorry, I see why “type signature” was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe LA is a boolean fact that is edited? But if so I don’t know which fact it is, and I’m confused by the way you described it.
Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −101010^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
Can we replace this with: “The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to −1010 utility.”? This is what it’s like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.
This one. I’m confused about what the intuitive intended meaning of the symbol is. Sorry, I see why “type signature” was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe LA is a boolean fact that is edited? But if so I don’t know which fact it is, and I’m confused by the way you described it.
Can we replace this with: “The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to −1010 utility.”? This is what it’s like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.