Patrick and I discussed something like this at a previous MIRIx. I think the big problem is that (if I understand what you’re suggesting) it basically just implements CDT.
For example, in Newcomb’s problem, if X=1 implies Omega is correct and X=0 implies the agent won’t necessarily act as predicted, and it acts conditioned on X=0, then it will twobox.
The example I was thinking of was instead of eg conditioning on “the button wasn’t pressed” in corrigibility, you have corrigibility only implemented if the button is pressed AND X=1. Then the counterfactual is just X=0.
We might be talking about different things when we talk about counterfactuals. Let me be more explicit:
Say an agent is playing against a copy of itself on the prisoner’s dilemma. It must evaluate what happens if it cooperates, and what happens if it defects. To do so, it needs to be able to predict what the world would look like “if it took action A”. That prediction is what I call a “counterfactual”, and it’s not always obvious how to construct one. (In the counterfactual world where the agent defects, is the action of its copy also set to ‘defect’, or is it held constant?)
In this scenario, how do you use a stochastic event to “construct a counterfactual”? (I can think of some easy ways of doing this, some of which are essentially equivalent to using CDT, but I’m not quite sure which one you want to discuss.)
I see why you think this gives CDT now! I wasn’t meaning for this to be used for counterfactuals about the agent’s own decision, but about an event (possibly a past event) that “could have” turned out some other way.
The example was to replace the “press” with something more unhackable.
Patrick and I discussed something like this at a previous MIRIx. I think the big problem is that (if I understand what you’re suggesting) it basically just implements CDT.
For example, in Newcomb’s problem, if X=1 implies Omega is correct and X=0 implies the agent won’t necessarily act as predicted, and it acts conditioned on X=0, then it will twobox.
I’m not sure I understand this.
The example I was thinking of was instead of eg conditioning on “the button wasn’t pressed” in corrigibility, you have corrigibility only implemented if the button is pressed AND X=1. Then the counterfactual is just X=0.
Is there a CDT angle to that?
We might be talking about different things when we talk about counterfactuals. Let me be more explicit:
Say an agent is playing against a copy of itself on the prisoner’s dilemma. It must evaluate what happens if it cooperates, and what happens if it defects. To do so, it needs to be able to predict what the world would look like “if it took action A”. That prediction is what I call a “counterfactual”, and it’s not always obvious how to construct one. (In the counterfactual world where the agent defects, is the action of its copy also set to ‘defect’, or is it held constant?)
In this scenario, how do you use a stochastic event to “construct a counterfactual”? (I can think of some easy ways of doing this, some of which are essentially equivalent to using CDT, but I’m not quite sure which one you want to discuss.)
I see why you think this gives CDT now! I wasn’t meaning for this to be used for counterfactuals about the agent’s own decision, but about an event (possibly a past event) that “could have” turned out some other way.
The example was to replace the “press” with something more unhackable.