Predictable Exploration

An idea I’ve been kicking around for a while without getting very far:

The big problem with epsilon-exploration as a way of getting reasonable counterfactuals is that it only tells the agent what would happen if it took an action unpredictably, rather than predictably; as a result, it doesn’t tend to do the best thing in problems where what you are predicted to do is quite important, such as Newcomb’s problem and game-theoretic problems.

A very direct-seeming approach to this problem is to try and explore, but in a predictable manner. This allows you to find out what the environment does if you take some action reliably.

The obvious way to go about this is to do epsilon-exploration, but with a very predictable pseudorandom source rather than the usual. There are two problems:

  1. If the logical inductor itself can easily predict the exploration, then it doesn’t actually help get good counterfactuals: the agent can see what action will be taken, so conditional expectations of other actions aren’t guaranteed to be well-defined or high-quality.

  2. Even if this did create well-behaved counterfactuals, it doesn’t seem like it does the right thing in game-theoretic situations. If the other players can see what action you’ll take, then they may simply exploit you. You could do the right thing in Newcomblike situations, where you just have to realize that predictably doing a certain thing is good; but in Prisoner’s Dilemma, predictably cooperating seems to just leave you open to defection, so you’d likely learn to defect.

Policy Exploration

My proposed fix for #2 is to explore over some notion of policy, rather than directly over actions. In general, a policy is a function from observations to actions—we might interpret this as a function from the deductive state to actions. That’s pretty big and intractable for a logical inductor to consider, though. In simultaneous-move games, a reasonable notion of policy is a function from your prediction of the other player’s actions to your action probability. For example, in Prisoner’s Dilemma, NicerBot is the policy of cooperating with probability epsilon greater than your estimate of the other player’s probability of cooperating. Copying the other player’s probability of cooperation incentivises them to cooperate when playing with you. Adding epsilon makes you slightly more exploitable than you would otherwise be, but it encourages cooperative equilibria over noncooperative; for example, it ensures that you cooperate with other NicerBots rather than defect.

If we could predictably explore into policies, we would start out occasionally being NicerBot. Whenever exploration picks NicerBot, the other player would be able to see this; the other player should then cooperate, since that’s the best thing you can do when playing with NicerBot. We then learn that things turn out pretty well for us if we’re NicerBot. We then start being NicerBot on non-exploration rounds as well.

There are a couple of holes in that story. For one thing, LIEDT and LICDT will defect against NicerBot. So, it’s not true that a wide variety of opponents would cooperate with NicerBot. Perhaps we’re OK with that because we only expect to be able to cooperate with more game-theoretically savvy players. But, a really savvy player will also defect against NicerBot, and hold out for NicerBot++ who cooperates with much higher probability than the player. The savvy player recognizes that it can teach us which policies to use by defecting and cooperating selectively, rather than myopically doing what gets the most utility on a single round.

But, perhaps we have to set those problems aside and settle for incremental progress. It still seems like this approach would solve some things we don’t currently know how to solve, if we could solve problem #1.

Unupdated Exploration

I can solve problem #1 in a somewhat unprincipled way which relates to the problem of logical updatelessness.

Suppose that we make decisions from the perspective of a logical inductor which has been run for less time. We could make our decisions at from the perspective of an or even a market. We still use epsilon-exploration, as usual, to make the counterfactuals well-behaved. However, from the perspective of other agents running for time, the exploration generated by a market will be predictable.

By itself, this isn’t very interesting, because it just says to run the market for less time. Running the market for less time may help make the agent more predictable, but it also makes the agent less intelligent.

When we combine it with the suggestion of exploring on policies, though, it suggests a two-stage approach to decisions. First, we run the market for (or or ect), and make a decision about policy. The policy determines the action based on market state at . Then, we compute market , and use the policy determined previously to choose the action.

This is essentially a hackish version of logical updatelessness. We run a logical inductor for a short time, and use it to decide how to respond to a logical inductor which has been run for much longer. We hope that the first inductor has been run for long enough that its policy choice is sensible.

It’s not difficult to imagine generalizing this to a multi-stage approach, in which you decide how to decide how to [...] how to decide. This would, perhaps, help deal with the problem of savvy agents mentioned earlier. But, such a hierarchy would still have to deal with the question: what’s the earliest stage which we are OK giving power? We have to defer some judgement to early stages without losing everything to the madness of the early stages. We want to make decisions at the earliest stage where they can be sensibly made.

It’s not clear what “sensible” should mean.

Policy Representation

I mentioned earlier that the market state can’t properly reason about the full space of functions from market states to actions. This is even more true of earlier market states trying to use later ones, and even more true if we have any kind of iterated hierarchy of meta-policy decisions.

It seems, however, like there can be a lot to gain just from choosing a small number of key beliefs to make policies about. So, there should be some sensibly small policy representation which is still useful.

Again, it’s not clear what “sensible” should mean.

(I think Jessica had some thoughts on this a while ago, but I don’t remember what they were.)