# Predictable Exploration

An idea I’ve been kicking around for a while without getting very far:

The big problem with epsilon-exploration as a way of getting reasonable counterfactuals is that it only tells the agent what would happen if it took an action unpredictably, rather than predictably; as a result, it doesn’t tend to do the best thing in problems where what you are predicted to do is quite important, such as Newcomb’s problem and game-theoretic problems.

A very direct-seeming approach to this problem is to try and explore, but in a predictable manner. This allows you to find out what the environment does if you take some action reliably.

The obvious way to go about this is to do epsilon-exploration, but with a very predictable pseudorandom source rather than the usual. There are two problems:

1. If the logical inductor itself can easily predict the exploration, then it doesn’t actually help get good counterfactuals: the agent can see what action will be taken, so conditional expectations of other actions aren’t guaranteed to be well-defined or high-quality.

2. Even if this did create well-behaved counterfactuals, it doesn’t seem like it does the right thing in game-theoretic situations. If the other players can see what action you’ll take, then they may simply exploit you. You could do the right thing in Newcomblike situations, where you just have to realize that predictably doing a certain thing is good; but in Prisoner’s Dilemma, predictably cooperating seems to just leave you open to defection, so you’d likely learn to defect.

## Policy Exploration

My proposed fix for #2 is to explore over some notion of policy, rather than directly over actions. In general, a policy is a function from observations to actions—we might interpret this as a function from the deductive state to actions. That’s pretty big and intractable for a logical inductor to consider, though. In simultaneous-move games, a reasonable notion of policy is a function from your prediction of the other player’s actions to your action probability. For example, in Prisoner’s Dilemma, NicerBot is the policy of cooperating with probability epsilon greater than your estimate of the other player’s probability of cooperating. Copying the other player’s probability of cooperation incentivises them to cooperate when playing with you. Adding epsilon makes you slightly more exploitable than you would otherwise be, but it encourages cooperative equilibria over noncooperative; for example, it ensures that you cooperate with other NicerBots rather than defect.

If we could predictably explore into policies, we would start out occasionally being NicerBot. Whenever exploration picks NicerBot, the other player would be able to see this; the other player should then cooperate, since that’s the best thing you can do when playing with NicerBot. We then learn that things turn out pretty well for us if we’re NicerBot. We then start being NicerBot on non-exploration rounds as well.

There are a couple of holes in that story. For one thing, LIEDT and LICDT will defect against NicerBot. So, it’s not true that a wide variety of opponents would cooperate with NicerBot. Perhaps we’re OK with that because we only expect to be able to cooperate with more game-theoretically savvy players. But, a really savvy player will also defect against NicerBot, and hold out for NicerBot++ who cooperates with much higher probability than the player. The savvy player recognizes that it can teach us which policies to use by defecting and cooperating selectively, rather than myopically doing what gets the most utility on a single round.

But, perhaps we have to set those problems aside and settle for incremental progress. It still seems like this approach would solve some things we don’t currently know how to solve, if we could solve problem #1.

## Unupdated Exploration

I can solve problem #1 in a somewhat unprincipled way which relates to the problem of logical updatelessness.

Suppose that we make decisions from the perspective of a logical inductor which has been run for less time. We could make our decisions at from the perspective of an or even a market. We still use epsilon-exploration, as usual, to make the counterfactuals well-behaved. However, from the perspective of other agents running for time, the exploration generated by a market will be predictable.

By itself, this isn’t very interesting, because it just says to run the market for less time. Running the market for less time may help make the agent more predictable, but it also makes the agent less intelligent.

When we combine it with the suggestion of exploring on policies, though, it suggests a two-stage approach to decisions. First, we run the market for (or or ect), and make a decision about policy. The policy determines the action based on market state at . Then, we compute market , and use the policy determined previously to choose the action.

This is essentially a hackish version of logical updatelessness. We run a logical inductor for a short time, and use it to decide how to respond to a logical inductor which has been run for much longer. We hope that the first inductor has been run for long enough that its policy choice is sensible.

It’s not difficult to imagine generalizing this to a multi-stage approach, in which you decide how to decide how to [...] how to decide. This would, perhaps, help deal with the problem of savvy agents mentioned earlier. But, such a hierarchy would still have to deal with the question: what’s the earliest stage which we are OK giving power? We have to defer some judgement to early stages without losing everything to the madness of the early stages. We want to make decisions at the earliest stage where they can be sensibly made.

It’s not clear what “sensible” should mean.

## Policy Representation

I mentioned earlier that the market state can’t properly reason about the full space of functions from market states to actions. This is even more true of earlier market states trying to use later ones, and even more true if we have any kind of iterated hierarchy of meta-policy decisions.

It seems, however, like there can be a lot to gain just from choosing a small number of key beliefs to make policies about. So, there should be some sensibly small policy representation which is still useful.

Again, it’s not clear what “sensible” should mean.

(I think Jessica had some thoughts on this a while ago, but I don’t remember what they were.)

• Hm, I got the same result from a different direction.

It’s possible to view a policy of the form “I’ll compute X and respond based on what X outputs” as… tying your output to X, in a sense. Logical link formation, if you will.

And policies of the form “I’ll compute X and respond in a way that makes that output of X impossible/​improbable” (can’t always do this) correspond to logical link cutting.

And with this, we see what the chicken rule in MUDT/​exploration in LIDT is doing. It’s systematically cutting all the logical links it can, and going “well, if the statement remains correlated with me despite me trying my best to shake off anything that predicts me too well, I guess I “cause” it.”

But some potentially-useful links were cut by this process, such as “having short abstract reasoning available that lets others predict what you will do” (a partner in a prisoner’s dilemma, the troll in troll bridge, etc..)

At the same time, some links should be cut by a policy that diagonalizes against predictions/​calls upon an unpredictable process (anything that can be used to predict your behavior in matching pennies, evading Death when Death can’t crack your random number generator, etc...)

So I wound up with “predictable policy selection that forms links to stuff that would be useful to correlate with yourself, and cuts links to stuff that would be detrimental to have correlated with yourself”.

Predictably choosing an easy-to-predict policy is easy-to-predict, predictably choosing a hard-to-predict policy is hard-to-predict.

This runs directly into problem 1 of “how do you make sure you have good counterfactuals of what would happen if you had a certain pattern of logical links, if you aren’t acting unpredictably”, and maybe some other problems as well, but it feels philosophically appealing.

• Thinking about this more, I think there’s an important disanalogy between trying to make policy decisions with earlier market states vs smaller proof-searches.

In Agent Simulates Predictor, we can use an earlier market state to decide our policy, because the earlier market state can trust the predictor to make the right predictions, even if the predictor is using a more powerful logic (since logical inductors can learn to boundedly trust more powerful logics).

However, with proof-based DTs, no analogous move is possible.

Consider a version of Agent Simulates Predictor in which Omega searches for a proof that you one-box in PA+Con(PA); if one is found, Omega fills the $1m box. Otherwise, not. Omega has time to think. The agent has time to think, . The agent reasons in PA. If the agent refused to use all its time, and only ran for time, but still had enough time to find interesting proofs, then it could reason as follows: “If I one-box, then there is a short proof that I one-box which Omega can find. So I get$1M.” It may not know if PA+Con(PA) is sound, but that doesn’t matter; the agent just has to ensure that there is a proof which Omega will find. It wouldn’t find any proofs leading to higher utility that this, so it would one-box and get $1M. Unfortunately, I don’t see any way to harness the shorter proof-search to choose a policy which would get the$1M in this case but choose to think longer in other cases where that’s beneficial.

We might want the agent to reason: “If I stop and one-box right now, Omega will be able to prove that I one-box, and I’ll get $1M. If I wait longer, Omega won’t be able to prove what I do, so I’ll at most be able to get$100. So, I’ll stop now and one-box.” However, this reasoning would have to take place at a proof-length in which several things hold at once:

• The agent can prove that it’s still “early” enough that its action would be provable to Omega if it acted now.

• It’s “late” enough that the agent can see that Omega’s predictions are sound (IE, it can check that Omega doesn’t reach false results in the limited time it has). This allows the agent to see that it’ll never get money from both boxes.

It seems very unlikely that there is a proof length where these can both be true, due to bounded Löb.

For logical induction, on the other hand, there’s quite likely to be a window with analogous properties.

• So I wound up with “predictable policy selection that forms links to stuff that would be useful to correlate with yourself, and cuts links to stuff that would be detrimental to have correlated with yourself”.

Agreed!

I’m reading this as “You want to make decisions as early as you can, because when you decide one of the things you can do is decide to put the decision off for later; but when you make a decision later, you can’t decide to put it earlier.”

And “logical time” here determines whether others can see your move when they decide to make theirs. You place yourself upstream of more things if you think less before deciding.

This runs directly into problem 1 of “how do you make sure you have good counterfactuals of what would happen if you had a certain pattern of logical links, if you aren’t acting unpredictably”, and maybe some other problems as well, but it feels philosophically appealing.

Here’s where I’m saying “just use the chicken rule again, in this stepped-back reasoning”. It likely re-introduces versions the same problems at the higher level, but perhaps iterating this process as many times as we can afford is in some sense the best we can do.

• If the other players can see what action you’ll take, then they may simply exploit you.

Isn’t this a variant of the “agent simulates predictor” problem (with you playing the role of the predictor)? Thus any agent capable of exploiting you has to prove to you that it won’t, in order to get anything from you. That’s kind of what happens with your Nicerbots; even if perfectly predictable, they’re not really exploitable in any strong sense (they won’t cooperate with a defector).

• I think the point I was making here was a bit less clear than I wanted it to be. I was saying that, if you use predictable exploration on actions rather than policies, then you only get to see what happens when you predictably take a certain action. This is good for learning pure equilibria in games, but doesn’t give information which would help the agent reach the right mixed equilibria when randomized actions should be preferred; and indeed, it doesn’t seem like such an agent would reach the right mixed equilibria.

I believe the “predictable exploration on policies” approach solves agent-simulates-predictor just fine, along with other problems (including counterfactual mugging) which require “some degree of updatelessness” without requiring the full reflective stability which we want from updatelessness.