If exploration is a hack, then why do pretty much all multi-armed bandit algorithms rely on exploration into suboptimal outcomes to prevent spurious underestimates of the value associated with a lever?
The multi-armed bandit problem is a many-round problem in which actions in early rounds provide information that is useful for later rounds, so it makes sense to explore to gain this information. That’s different from using exploration in one-shot problems to make the counterfactuals well-defined, which is a hack.
If exploration is a hack, then why do pretty much all multi-armed bandit algorithms rely on exploration into suboptimal outcomes to prevent spurious underestimates of the value associated with a lever?
The multi-armed bandit problem is a many-round problem in which actions in early rounds provide information that is useful for later rounds, so it makes sense to explore to gain this information. That’s different from using exploration in one-shot problems to make the counterfactuals well-defined, which is a hack.