The most prominent undesirable feature of this is that it’s restricted to a finite set of embedders. Optimistic choice fails very badly on an infinite set of embedders, because we can consider an infinite sequence of embedders that are like “pressing the button dispenses 1 utility forever”, “pressing the button delivers an electric shock the first time, and then dispenses 1 utility forever”… “pressing the button delivers an electric shock for the first n times, and then dispenses 1 utility forever”… “pressing the button just always delivers an electric shock”. Optimistic choice among embedders keeps pressing the button, because, although it keeps getting shocked, there’s always an embedder that promises that the agent’s fortunes will turn around on the next turn.
Seems like infinite bandit algorithms should work here? Basically just give more complex embedders some regularization penalty.
Seems like infinite bandit algorithms should work here? Basically just give more complex embedders some regularization penalty.