Evidential Correlations are Subjective, and it might be a problem

I explain (in layman’s terms) a realization that might make acausal trade hard or impossible in practice.

Summary: We know that if players believe different Evidential Correlations, they might miscoordinate. But clearly they will eventually learn to have the correct Evidential Correlations, right? Not necessarily, because there is no objective notion of correct here (in the way that there is for math or physics). Thus, selection pressures might be much weaker, and different agents might systematically converge on different ways of assigning Evidential Correlations.

Epistemic status: Confident that this realization is true, but the quantitative question of exactly how weak the selection pressures are remains open.

What are Evidential Correlations, really?

Skippable if you know the answer to the question.

Alice and Bob are playing a Prisoner’s Dilemma, and they know each other’s algorithms: Alice.source and Bob.source.[1] Since their algorithms are approximately as complex, each of them can’t easily assess what the other will output. Alice might notice something like “hmm, Bob.source seems to default to Defection when it throws an exception, so this should update me slightly in the direction of Bob Defecting”. But she doesn’t know exactly how often Bob.source throws an exception, or what it does when that doesn’t happen.

Imagine, though, Alice notices Alice.source and Bob.source are pretty similar in some relevant ways (maybe the overall logical structure seems very close, or the depth of the for loops is the same, or she learns the training algorithm that shaped them is the same one). She’s still uncertain about what any of these two algorithms outputs[2], but this updates her in the direction of “both algorithms outputting the same action”.

If Alice implements/​endorses Evidential Decision Theory, she will reason as follows:

Conditional on Alice.source outputting Defect, it seems very likely Bob.source also outputs Defect, thus my payoff will be low.
But conditional on Alice.source outputting Cooperate, it seems very likely Bob.source also outputs Cooperate, thus my payoff will be high.
So I (Alice) should output Cooperate, thus (very probably) obtain a high payoff.

To the extent Alice’s belief about similarity was justified, it seems like she will perform pretty well on these situations (obtaining high payoffs). When you take this reasoning to the extreme, maybe both Alice and Bob are aware that they both know this kind of cooperation bootstrapping is possible (if they both believe they are similar enough), and thus (even if they are causally disconnected, and just simulating each others’ codes) they can coordinate on some pretty complex trades. This is Evidential Cooperation in Large worlds.

But wait a second: How could this happen, without them being causally connected? What was this mysterious similarity, this spooky correlation at a distance, that allowed them to create cooperation from thin air?

Well, in the words of Daniel Kokotajlo: it’s just your credences, bro!

The bit required for this to work is that they believe that “it is very likely we both output the same thing”. Said another way, they have high probability on the possible worlds “Alice.source = C, Bob.source = C” and “Alice.source = D, Bob.source = D”, but low probability on the possible worlds “Alice.source = D, Bob.source = C” and “Alice.source = D, Bob.source = C”.

This can also be phrased in terms of logical counterfactuals: if Alice.source = C, then it is very likely that Bob.source = C.[3] This is a logical counterfactual: there is, ultimately, a logical fact of the matter about what Alice.source outputs, but since she doesn’t know it yet, she entertains what seems likely to happen in both mathematically possible worlds (the one with Alice.source = C and the one with Alice.source = D). After the fact, it will turn out that one of these worlds was mathematically impossible all along. But at the time of decision, she doesn’t know which.

Logical counterfactuals and learning

What’s up with these logical counterfactuals, though? What part of reality are they really tracking? Why exactly is Alice justified in believing Alice.source = C makes it more likely Bob.source = C?

Notice first, there is a mathematically well-defined sense about what we mean by (some) empirical counterfactuals. For example, “What would have happened if the fundamental low-level laws of physics where X, instead of Y”? Well, you just run the mathematical calculations! “What would have happened if, at this exact point in time, the quantum coin had landed the other way?” That’s just like asking about what happens in a different quantum branch! You can just run the calculations![4] Put another way, when assuming mathematical omniscience, there are some well-defined interpretations of empirical uncertainty and empirical counterfactuals.

There doesn’t seem to be a similar fact of the matter about “whether Fermat’s Last Theorem would be true, were the second digit of a 9”. The digit is not a 9! What am I supposed to do to compute the likely truths of that other world, which I know leads to contradiction and logical explosion?
You can’t run any calculation: the maths themselves you would use for that is exactly what is being put into question, what we are being uncertain about. Put another way, now we cannot assume mathematical omniscience.
If you are uncertain about the digit, but have seen a proof in the past that the digit being 9 implies Fermat’s Last Theorem is false, then you will say false. If you haven’t seen such a proof, you might fall back on 50% for its truth.

There have been some attempts at obtaining a static, realist picture of logical counterfactuals, which tells you “the real, undisputable probability you should put on that statement for decision-making purposes, always valid, bestowed upon us by God”. Unfortunately, depending on how much mathematical structure you fix (that is, which are your current mathematical beliefs) you will see the landscape of possible mathematical worlds differently. And nothing is pointing in the direction of there existing a privileged notion of “exactly how much structure to fix”.

It seems like the best thing we can hope for is a dynamic picture of logical learning, like logical induction. That is, a bunch of half-guessing heuristics getting selected for on the basis of how well they perform at predicting real mathematics (whether Fermat’s Last Theorem is true), which does have a ground truth to check against, unlike mathematical counterfactuals.
Indeed, it should not be surprising that this is the best we can do, since it’s what real agents (like mathematicians) do: they don’t have a universal procedure to easily check whether a counterfactual dependency might hold, or whether a research direction is promising, etc. Instead, they have a bag of mostly illegible pattern-finding heuristics that they’ve learned over the years, due to training them against how real theorems are proved.[5]

Non-counterfactual mathematical statements are the ground truth we’re trying to approximate. On the contrary, counterfactual statements are nothing but a temporary byproduct of your heuristics churning away at that approximation problem. The fumes coming out of the machine.

The important difference is the following:
For any prediction/​disagreement about mathematical statements, you can just check it. But for any prediction/​disagreement about counterfactuals, if you look at it close enough (try to check it), either

  • it dissolves (the counterfactual wasn’t true, so your prediction was meaningless), or

  • it becomes a question about mathematical statements (the counterfactual was true, so now let’s check how accurate you were at predicting mathematical statements in this world)

The problem can also be phrased in terms of conditional bets:
At some point, you are considering several counterfactuals, and making various guesses about what might happen if each of them turns out true. But you know that, eventually, only one of them will be true (even if you don’t yet know which). For that one, you’ll actually get feedback from reality, and be able to assess your calibration. But for all the others, there is no ground truth to check, and your predictions go forever un-assessed.

If there is a kind of counterfactual that almost never gets realized, one might worry we’ll have poor calibration on them, since reality hasn’t given us many data points to learn from (indeed, this happens in logical inductors). And even more: what would it even mean to be well-calibrated on counterfactuals that don’t happen? It seems like the bettor has here a degree of freedom, that won’t affect its performance. And the problem will come up when these degrees of freedom are made decision-relevant.

What do I mean by Subjective?

Here’s a more formal argument, framed in terms of latent variables:

Say there are some facts about the real world, X, which talk about real logic and math. Stuff like whether Fermat’s Last Theorem is true.

Two agents A and A’ want to predict them efficiently.

To do this, they have some latent variables L and L’, which are basically their world-models, and help them predict X. The latents would include things like “this is how I think the landscape of logical correlations (or counterfactuals) broadly looks like”.

A weak worry

One first worry is that A and A’ have different beliefs about X: What if Bob, wrongly, is very confident on Alice.source = D being the case, regardless of his action?

This is plainly analogous to an agent not knowing enough about empirical facts to successfully coordinate.

These short-comings might be realistic: indeed, logical inductors can be arbitrarily wrong about some fact for an arbitrarily long time. And similarly for the messy approximations to logical induction that we all implement in our brains.
Here, we would be using Subjectiveness to mean something very general and vacuous: “my current predictions depend on my past observation history”. Which is true of everything, including empirical predictions.

But on average, we probably expect A and A’ to have almost the same probability distribution over X. At least for powerful enough agents. Otherwise, one of them would be literally wrong, and reality would hit them in the face.
That is, probably selection pressures are strong enough to avoid this failure, and ensure their beliefs converge. This does tell us something about L and L’: that their carried information about X is somewhat isomorphic.

A stronger worry

But this still leaves a free parameter on how this information is represented, that is, the exact shape of L and L’ (even if they both are very good methods to approximate X). If you ask them something that is not in X (for example, about a logical counterfactual), they might give completely different answers.

Another way to say this is: Yes, the selection pressures for getting real math right will probably be strong enough. But the pressures shaping the heuristics they use to think about counterfactuals seem much weaker. Sure, one out of every bunch of counterfactuals does eventually get checked against reality, because the counterfactual turns out to be true. But this only gives you one data point, and your heuristic needs to have opinions about a lot of different counterfactuals.[6]
So what I really mean by Subjectiveness is more narrow: “there is no ground truth against which to check all predictions in this class, and as a result these predictions depend on some internal degrees of freedom that might not get strongly shaped by selection pressures”.[7] For example, math is not subjective under this definition.

So this becomes a question of whether selection pressures will be strong enough to make all powerful agents converge on using the same heuristics (mental tools, mental models) (L) to approximate ground truth (X). Which is stronger than ensuring they have the same beliefs about X. It is conceivable for this to happen, but it’s certainly an additional assumption we’re uncertain about.


How, exactly, could miscoordination come about because of this?

Say Alice is considering whether to implement a certain commitment (Com). She’s considering this because it looks pretty cooperation-inducing to her, so she thinks maybe Bob would also like it. She tries to think about how Bob would react if he learned[8] that she implemented Com. To do this, of course, she uses all of her available heuristics (that have been selected for by her past experiences with math) to reason through a counterfactual question: “conditional on Alice.source implementing Com, how likely is Bob.source to do the same?”. To her partial surprise, her heuristics have noticed a few regularities in Bob.source (that they have observed in past experiences in other algorithms) that point towards him reacting worse, were she to implement it. Thus, she doesn’t.
Meanwhile, Bob is also thinking. He notices Alice might want to implement Com. He reasons about what he might do in that situation (again, invoking his trained heuristics to assess a logical counterfactual), and arrives at the conclusion that he’d reciprocate (contrary to what Alice found out). To his surprise, it turns out Alice doesn’t implement Com, and as a result they end up not Cooperating.

The important, worrisome part about the above interaction, is that none of them was ever proven wrong. None of them got feedback from reality on their predictions, and as a result improved their heuristics to ensure this stops happening in the past. They both made a conditional prediction, but the antecedent turned out to not come about (Alice.source didn’t implement Com after all), so the predictions never resolved. It is thus a priori possible for agents to keep having these disastrous interactions without changing the heuristics they use to assess counterfactuals, always losing out on Pareto-optimality.

(One might argue that the gains from coordination are a strong enough selection pressure to ensure this doesn’t happen, see the last section.)

Another useful example, which illustrates how even communication between them might get hard:
Alice has already implemented Com, which happens to be of the shape “for any agent who does algorithmic-maneuver-Y, I Cooperate with them”. Bob looks at Alice.source, with Com implemented and all, and ponders what this commitment might entail, and how he should act in response. Thus, he’s considering counterfactuals of the form “if I do algorithmic-maneuver-Y, what will Alice.source output?”.
Alice tried to choose Com to make it as legible as possible to others, so that they can safely Cooperate. Legible here means “it’s easy to predict what Alice.source will do as a response to different types of agents (for example, those who do and don’t implement algorithmic-maneuver-Y)”. But of course, what Alice understands by “easy to predict” is relative to her own learned heuristics! Maybe Bob has evolved quite different heuristics to assess these same counterfactuals. Especially if the antecedents to these counterfactuals have not been coming true in the past, so that they haven’t gotten feedback from reality. As a result, from Bob’s beliefs on counterfactuals, it seems like executing algorithmic-maneuver-Y would make Alice.source more likely to Defect (contrary to what Alice thought). Thus Bob doesn’t perform algorithmic-maneuver-Y, and as above, none of the two are proven wrong.

Both these examples can be easily formalized using logical inductors. Indeed, logical inductors with the right starting priors (distributions over traders) can get forever stuck on these non-cooperating equilibria, even while their beliefs about real math get arbitrarily good. This is pretty related to FixDT problems, and it’s looking like the only way to solve it is by forcing -exploration.

Uncertain solutions

1. Just think about Bob

Alice could also try not just to use her own heuristics to assess “how legible this commitment seems”, but also reason about Bob’s heuristics, and how they will react. First off, that might be computationally harder, since Bob.source is as complex as Alice.source. But the real problem is that this is still a logical counterfactual (“if Alice.source implements Com, how does Bob.source react?”), about which they might disagree. That said, it is conceivable that doing this does incentivize better fix-points.

2. Just talk it out

Alice and Bob are losing out on Cooperation due to having different counterfactual beliefs. Why don’t they just talk it out, try to make their beliefs converge (aggregate them in some way), or at least find some commitment mechanism that they both clearly see almost certainly ensures Cooperation?

A first problem is that they might be purposefully avoiding reasoning about some of each other’s algorithmic behaviors, due to Commitment Race dynamics. But that’s a separate problem (and its extreme versions don’t seem that likely), so let’s just assume they both are completely okay thinking about each other.

A second, opposite problem is they might have private information they don’t want to reveal. But again, different problem, plus unclear how likely in purely acausal scenarios.

The real problem is that their disagreements about counterfactuals might themselves hinder communications, or make them impossible.
Imagine Alice says something as simple as “If you say coconut, I will Cooperate”. This utterance is meant to be evidence for Bob that Cooperation is easily achievable.
But put yourself on Bob’s shoes. You have just learned a new, undisputable mathematical fact about Alice.source (that in this situation, it outputs the string “If you say coconut, I will Cooperate”). But what does that really tell you about Alice.source, and especially about how it will act in different logical counterfactuals? (The one where you say coconut, the one where you say banana, etc.) Again, to Alice’s heuristics it might look clear that her algorithm is wired in such a way, that speaking that sentence is very strong evidence in favor of the prediction “if Bob.source outputs coconut, then Alice.source = C”. But Bob’s different heuristics might assert otherwise. Similar things can happen for more complex coordination-minded sentences like “let’s start a deliberation process through which we try to find a cooperation mechanism we both like”.[9]

That said, it is again conceivable that trying procedures like these does incentivize better fix-points.

I’ll also note it’s unclear how they could even aggregate their different opinions about correlations/​counterfactuals, were they to try, since they can’t check those counterfactuals against ground truth. (Except if they already know which counterfactual they’re in, that is, what they both play, which defeats the purpose.)
This is a problem that not only superintelligences like Alice and Bob experience, but also acausal-tarde researchers with different intuitions as to how strong the correlation with other kinds of agents might be (for example).
A natural proposal is re-running both their mathematical learnings, but now as a single pile of learning heuristics (a single logical inductor) rather than two. It’s even conceivable this is a good mathematical Schelling point from which to bootstrap coordination (post coming soon).

3. Just implement a Löbian handshake

In a similar spirit, we already know proof-based agents (and maybe even probabilistically proof-based) can coordinate in some cool ways, so why don’t they just do that?

The problem is again that, to the extent a heuristics-based agent is considering whether to partially rewrite itself to implement a proof-based protocol, it will need to ponder how likely this is to lead to good outcomes. And these are opinions about logical counterfactuals.
Also, unless the agents rewrite into completely proof-based agents, they will still have some heuristics-based parts, and it might be hard to assess (using heuristics) how these parts will interact with the proof-based module.

Although again, Löbian handshakes might be enough of a mathematical Schelling point to incentivize better fix-points.

4. Just avoid counterfactuals

Here’s why you can’t:
If Alice weren’t uncertain of what Bob will end up doing, there would by definition be no evidential coordination possible.
And Alice being uncertain means exactly her pondering different logical counterfactuals as possible.

Since they can’t just simulate each other (infinite recursion), at least one of them will have to make a decision (either to an action, or a policy) at some point, before being completely certain of what the other will do. That is, they can’t both be the more meta. And this agent will have to do so by vaguely predicting the other’s response, using heuristics based on past experience, but without certainty about it. This “without certainty” means exactly that different counterfactuals still seem possible.

5. Just force -exploration

As mentioned above, this is the natural thing to fix it on logical inductors. If Alice sometimes -explores into Cooperating, she will find out actually Bob also reciprocates with Cooperation, and from there they’ll notice this is an equilibria they can get (finally providing feedback from reality against beliefs to the contrary), and bootstrap cooperation.

There are many reasons to dislike -exploration. The obvious one is that we’d prefer not to need to take random actions sometimes (which might be bad, even when we know better). And of course, the frequency with which you explore trades off against how fast you get out of bad equilibria.

A more important one is that the exploration can also kick you out of good equilibria (for example, inducing Defection).

Although again, it’s not clear a version of -exploration can’t pragmatically help quite a bit.


Getting practical: How likely does it seem that this is a real problem for superintelligences?

To start with, how likely does it seem that selection pressures on real-math prediction are not strong enough to ensure agents end up with close enough tools to think about counterfactuals? How hard does it seem like these tools get shaped, given they only receive feedback on one out of every many predictions?

It’s conceivable to believe that a few finite core heuristics will be the most efficient tool to predict any corner of the logical landscape. And so, agents intelligent enough to have found these will have the same beliefs about logical counterfactuals.

As mentioned above, the latter (a static, realist picture of literally all logical counterfactuals) doesn’t seem likely. It seems more likely that more complex heuristics always give you better predictions of real math (and that these changes in heuristics change their opinions on infinitely many counterfactuals). More concretely, that when you run a logical inductor, new successful traders keep appearing forever (and have different opinions on some conditional bets).

But some more modest and pragmatic claim seems more likely: the existence of such an efficient core of heuristics to predict the logical landscape up to a certain complexity.[10] If the “complexity of predictions”/​”size of core” ratio is high enough, the “size of core” might be small enough for all powerful agents to have it, and the “complexity of predictions” might be high enough to encompass the relevant properties about these agents’ behaviors. Thus, their counterfactuals would agree on all practical matters.

My unjustified intuition is that the returns from more heuristics will be much more gradual (no clear “core”). But at the same time, that enough convergence will exist about the “tricks that on average work more” (whether you represent them in this or that way) that faraway superintelligences (in the same mathematical universe) will, at least, be able to bootstrap successful coordination if that’s their sole objective.

Digits of being uniformly pseudo-random, or my near-copy doing the same thing as me, are clearly very selected-for beliefs. And when thinking about the more complex mathematical statements involved in reasoning about the algorithms of superintelligences, I still have the mildly optimistic intuition that some high-dimensional enough abstractions will be “clearly the efficient thing”. But I might be understating the path-dependence in abstraction formation.

As a counterpoint, as algorithms become more complex (the space of heuristics grows), their possible interactions (the counterfactuals you need to have opinions on) grow exponentially. It is conceivable that the total amount of observations doesn’t grow exponentially on the complexity of the algorithm (this is the case for logical inductors). So with time you get less observations per counterfactual, which means a smaller fraction of your beliefs on counterfactuals are justified (shaped by reality).

We can also consider other selection pressures, different from efficient prediction of real math.
For example, maybe coordination or a shared counterfactual language is directly selected for. That is, group selection ensures that agents with heuristics shaped so as to incentivize better fix-points are more likely to survive. The problem here is that this seems to mainly incentivize coordination in causal interactions, and the heuristics needed to incentivize coordination in acausal interactions might be very different. Put another way, the heuristics didn’t generalize well. Although it is also conceivable for some acausal interactions to make the participants lose resources, and so selection also exists in these.
(More speculation about the epistemic states and coordination mechanisms of superintelligences here and here.)


Beliefs about evidential correlations (= logical counterfactuals) might not get as strongly shaped by selection pressures as beliefs about ground-truth mathematical facts (or empirical facts), even when we limit ourselves to this mathematical universe. If so, faraway superintelligences might have irresolvably different beliefs about correlations, and this can hinder coordination (like acausal trade) even when both “would have wanted it”, or “tried hard to achieve it”.

It is the quantitative strength of selection pressure (plus other considerations like whether the superintelligences can even bootstrap a conversation) that determines whether this failure happens in reality. It is conceivable that all superintelligences converge on the same mental tools to deal with correlations (thus easily coordinate), but also that this doesn’t happen.

While there doesn’t seem to be any sure-fire solution, I am mildly optimistic about the strength of convergence. And so, I predict that superintelligences failing to coordinate due to these considerations (assuming that is their sole objective, thus ignoring other worries like Commitment Races) is relatively unlikely (20%).

  1. ^

    In realistic situations, they will also be empirically uncertain about which are their exact algorithms. This doesn’t prevent Evidential Cooperation from happening, given they have appropriate empirical beliefs.

  2. ^

    Indeed, from her perspective, she still has mostly free choice over what Alice.source outputs, since she is exactly that algorithm, and is right now deliberating about which action to output.

  3. ^

    Here the counterfactual is taken by conditioning on the antecedent (since we are being Evidentialist), but this would also work for a Logi-Causalist view, in which some logical statements (like Alice.source = C) literally cause others to be true (like Bob.source = C).

  4. ^

    Not really because we don’t have a Theory of Everything, but you get the point.

  5. ^

    For more on the philosophical consequences of agents being piles of heuristics, see this philosophy article I wrote for a class.

  6. ^

    Speculatively, the situation might be even harder for heuristics learned in the past to navigate when the content of the counterfactual under scrutiny talks about how other agents think about counterfactuals. This happens when Alice thinks about what Bob might do who thinks about what Alice might do… This might make most such interactions have too chaotic ramifications for past heuristics to be well-calibrated on them, or behave in a way that reliably leads to coordination.

  7. ^

    Notice that I am not saying “opinions on counterfactuals are subjective because all probabilities are”. That’s not useful, because the world shapes your probabilities (or better said, selects for agents with the right probabilities).

    I’m also not claiming “math is relative, because in other (im)possible mathematical universes it is different”. I’m not discussing the possibility of terminally caring for mathematically counterfactual worlds here. I’m assuming math is fixed throughout this argument, and I’m just describing something that might happen to algorithms (inside real math).

  8. ^

    This learning can happen as purely logical updates (noticing Alice.source does this), without needing empirical observations.

  9. ^

    Then how come humans can coordinate using language?
    In fact, the situation for humans is even harder, since we can’t transparently see each other’s code.
    Even if humans were close to perfect expected value maximizers, the answer would be that, both due to socially systematized language learning, and other common properties of our brains, we happen to share enough of a common understanding about logical counterfactuals (and even more, about how our unseen algorithms might behave) to coordinate.
    But it’s maybe even more important that humans have been jointly optimized to achieve coordination.

  10. ^

    This bears resemblance to the objective of heuristic arguments: finding a finite heuristic estimator that yields a good enough approximation of the mathematical landscape, up least up to the complexity level of the ML model behaviors we care about. My impression is an incredibly strong solution to their objective would provide evidence that such “cores of shared heuristics” exist. As complete speculation, a heuristic estimator might even be transformable into a literal “counterfactual assessor”, by letting it take as heuristic arguments an unjustified statement that “a proof of X exists” (where X is the counterfactuals’ antecedent).