You’ve explicitly said we’re ignoring path depencies and time
I wasn’t very clear, but I meant this in a less restrictive sense than what you’re imagining.
I meant only that if you know the diagram, and you know the current state, you’re fully set up to reason what the agent ought to do (according to its preferences) in its next action.
I’m trying to rule out cases where the optimal action from a state A depends on some extra info beyond the simple fact that we’re in A, such as the trajectory we took to get there, or some number like “money” that hasn’t been included in the states yet is still supposed to follow the agent around somehow.
But I still allow that the agent may be doing time discounting, which would make A->B->C less desirable than A->C.
The setup is meant to be fairly similar to an MDP, although it’s deterministic (mainly for presentational simplicity), and we are given pairwise preferences rather than a reward function.
I wasn’t very clear, but I meant this in a less restrictive sense than what you’re imagining.
I meant only that if you know the diagram, and you know the current state, you’re fully set up to reason what the agent ought to do (according to its preferences) in its next action.
I’m trying to rule out cases where the optimal action from a state A depends on some extra info beyond the simple fact that we’re in A, such as the trajectory we took to get there, or some number like “money” that hasn’t been included in the states yet is still supposed to follow the agent around somehow.
But I still allow that the agent may be doing time discounting, which would make A->B->C less desirable than A->C.
The setup is meant to be fairly similar to an MDP, although it’s deterministic (mainly for presentational simplicity), and we are given pairwise preferences rather than a reward function.