It might be relevant to note that the meaningfulness of this coherence definition depends on the chosen environment. For instance, in an deterministic forest MDP where an agent at a state can never return to for any and there is only one path between any two states, suppose we have a deterministic policy and let , , etc. Then for the zero-current-payoff Bellman equations, we only need that for any successor from , for any successor from , etc. We can achieve this easily by, for example, letting all values except be near-zero; since is a successor of iff (as otherwise there would be a cycle), this fits our criterion. Thus, every is coherent in this environment. (I haven’t done the explicit math here, but I suspect that this also works for non-deterministic and non-stochastic MDPs.)
Importantly, using the common definition of language models in an RL setting where each state represents a sequence of tokens and each action adds a token to the end of a sequence of length to produce a sequence of length , the environment is a deterministic forest, as there is only one way to “go between” two sequences (if one is a prefix of the other, choose the remaining tokens in order). Thus, any language model is coherent, which seems unsatisfying. We could try using a different environment, but this risks losing stochasticity (as the output logits of an LM is determined by its input sequence) and gets complicated pretty quickly (use natural abstractions/world model as states?).
Right, I think this somewhat corresponds to the “how long it takes a policy to reach a stable loop” (the “distance to loop” metric), which we used in our experiments.
What did you use your coherence definition for?