And also, if we have two hypotheses, H1 and H2, and policy π has a much lower expected value compared to BATNA, such that both terms in the product are negative, then the total product is positive (and large), and argmax is going to choose this policy (which is strictly worse than BATNA).
But I guess both of those issues can be easily assumed away.
And also, if we have two hypotheses, H1 and H2, and policy π has a much lower expected value compared to BATNA, such that both terms in the product are negative, then the total product is positive (and large), and argmax is going to choose this policy (which is strictly worse than BATNA).
But I guess both of those issues can be easily assumed away.