I thought about these things in writing this, but I’ll have to think about them again before making a full reply.
We could modify the epsilon exploration assumption so that the agent also chooses between a and a′ even while its top choice is b′. That is, there’s a lower bound on the probability with which the agent takes an action in {a,a′}, but even if that bound is achieved, the agent still has some flexibility in distributing probability between a and a′.
Another similar scenario would be: we assume the probability of an action is small if it’s sub-optimal, but smaller the worse it is.
I thought about these things in writing this, but I’ll have to think about them again before making a full reply.
Another similar scenario would be: we assume the probability of an action is small if it’s sub-optimal, but smaller the worse it is.