Counterfactuals in decision theory are about variation of environment and state of knowledge considered by an agent with fixed goals. I’m recently thinking that maybe it’s variation of preference that needs to be considered to capture corrigibility in decision theory. Something similar happens with uncertainty about (fixed) preference, but that conflates state of preference with state of knowledge, and the two pieces of data determining an agent in a given state might want to stay separate (as they vary across possible worlds).
In this setting, the counterfactuals are worlds/models where the facts can be different or determined to different extents (the latter point is often neglected), which can be thought of as worlds/states of a Kripke frame, or as points of a topological space (possibly a domain) ordered by specialization.
Incidentally, might be useful to call the whole thing an agent, across all counterfactuals, instead of separating its parts that exist in different possible worlds when using this term. This gives an unusual meaning in the case with variation of preference (model of corrigibility), so that an agent is internally incoherent, has different preference across different parts of itself, with preference in each part talking about the whole, and acausal trade coordinating across disagreements/variation of both fact and preference, in these terms within a single agent. (This reframes something intended as a model of corrigibility into something with inner alignment tension. The hope is that externally this behaves like soft optimization.)
Counterfactuals in decision theory are about variation of environment and state of knowledge considered by an agent with fixed goals. I’m recently thinking that maybe it’s variation of preference that needs to be considered to capture corrigibility in decision theory. Something similar happens with uncertainty about (fixed) preference, but that conflates state of preference with state of knowledge, and the two pieces of data determining an agent in a given state might want to stay separate (as they vary across possible worlds).
In this setting, the counterfactuals are worlds/models where the facts can be different or determined to different extents (the latter point is often neglected), which can be thought of as worlds/states of a Kripke frame, or as points of a topological space (possibly a domain) ordered by specialization.
Incidentally, might be useful to call the whole thing an agent, across all counterfactuals, instead of separating its parts that exist in different possible worlds when using this term. This gives an unusual meaning in the case with variation of preference (model of corrigibility), so that an agent is internally incoherent, has different preference across different parts of itself, with preference in each part talking about the whole, and acausal trade coordinating across disagreements/variation of both fact and preference, in these terms within a single agent. (This reframes something intended as a model of corrigibility into something with inner alignment tension. The hope is that externally this behaves like soft optimization.)