As a quick summary (read the paper and sequence if you want more details), they show that for any distribution over reward functions, if there are more “options” available after action 1 than after action 2, then most of the orbit of the distribution (the set of distributions induced by applying any permutation on the MDP, which thus permutes the initial distribution) has optimal policies that do action 1.
Also, this claim is missing the “disjoint requirement” and so it is incorrect even without the “they show that” part (i.e. it’s not just that the theorems in the paper don’t show the thing that is being claimed, but rather the thing that is being claimed is incorrect). Consider the following example where action 1 leads to more “options” but most optimal policies choose action 2:
Also, this claim is missing the “disjoint requirement” and so it is incorrect even without the “they show that” part (i.e. it’s not just that the theorems in the paper don’t show the thing that is being claimed, but rather the thing that is being claimed is incorrect). Consider the following example where action 1 leads to more “options” but most optimal policies choose action 2: