Recall that a multi-armed bandit is an MDP with one state. Similarly, supra-bandits are a special case of supra-MDPs. In particular, a supra-bandit can be described by a supra-MDP with where the state encodes the loss obtained from pulling the most recent arm. In this case, the transition kernel, which then takes the form depends only on the action. The loss also only depends on the action.
I’m a bit confused about how the loss is incurred. For example, in the classical case, suppose we have a lever that gives a 50⁄50 chance of +1/-1 loss. Then our loss doesn’t just depend on the action. It seems like you’d have to do the same trick here, where you encode the lever’s result as the state, and then have pulling the lever stochastically transition you between states.
For both that case and the supra-bandit case, it seems like the loss that you incur on a transition should really be the loss that you got from the previous lever’s result. That is, say you pull the lever and got the good outcome. Then, you pull again and transition to the bad outcome, but get the loss according to the good outcome from the previous turn. Then, when you pull the lever again, you’ll get the loss of the bad outcome, etc.
This does seem to make your time discounting ‘one step’ slow, which you’d either want to correct or just leave as it is.
Shifting the losses by one time step doesn’t really matter, since we’re mostly interested in the shape of the regret bound which (up to mild changes in the constants) is not affected by this.
I’m a bit confused about how the loss is incurred. For example, in the classical case, suppose we have a lever that gives a 50⁄50 chance of +1/-1 loss. Then our loss doesn’t just depend on the action. It seems like you’d have to do the same trick here, where you encode the lever’s result as the state, and then have pulling the lever stochastically transition you between states.
For both that case and the supra-bandit case, it seems like the loss that you incur on a transition should really be the loss that you got from the previous lever’s result. That is, say you pull the lever and got the good outcome. Then, you pull again and transition to the bad outcome, but get the loss according to the good outcome from the previous turn. Then, when you pull the lever again, you’ll get the loss of the bad outcome, etc.
This does seem to make your time discounting ‘one step’ slow, which you’d either want to correct or just leave as it is.
Shifting the losses by one time step doesn’t really matter, since we’re mostly interested in the shape of the regret bound which (up to mild changes in the constants) is not affected by this.