In one way of looking at Bayesian reasoners, there are a bunch of possible worlds and a bunch of people, who start out with some guesses about what possible world we’re in. Everyone knows everyone else’s initial guesses. As evidence comes in, agents change their guesses about which world they’re in via Bayesian updating.

The Bayesians can share information just by sharing how their beliefs have changed.

“Bob initially thought that last Monday would be sunny with probability 0.8, but now he thinks it was sunny with probability 0.9, so he must have has seen evidence that he judges as 4/9ths as likely if it wasn’t sunny than if it was”

If they have the same priors, they’ll converge to the same beliefs. But if they don’t, it seems they can agree to disagree. This is a bit frustrating, because we don’t want people to ignore our very convincing evidence just because they’ve gotten away with having a stupid weird prior.

What can we say about which priors are permissible? Robin Hanson offers an argument that we must either (a) believe our prior was created by a special process that correlated it with the truth more than everyone else’s or (b) our prior must be the same as everyone else’s.

Meet the pre-Bayesians

How does that argument go? Roughly, Hanson describes a slightly more nuanced set of reasoners: the pre-Bayesians. The pre-Bayesians are not only uncertain about what world they’re in, but also about what everyone’s priors are.

These uncertainties can be tangled together (the joint distribution doesn’t have to factorise into their beliefs about everyone’s priors and their beliefs about worlds). Facts about the world can change their opinions about what prior assignments people have.

Hanson then imposes a pre-rationality condition: if you find out what priors everyone has, you should agree with your prior about how likely different worlds are. In other words, you should trust your prior in the future. Once you have this condition, it seems that it’s impossible to both (a) believe that some other people’s priors were generated in a way that makes them as likely to be good as yours and (b) have different priors from those people.

Let’s dig into the sort of things this pre-rationality condition commits you to.

Consider the class of worlds where you are generated by a machine that randomly generates a prior and sticks it in your head. The pre-rationality rule says that worlds where this randomly-generated prior describes the world well are more likely than worlds where it is a poor description.

So if I pop out with a very certain belief that I have eleven toes, such that no amount of visual evidence that I have ten toes can shake my faith, the pre-prior should indeed place more weight on those worlds where I have eleven toes and various optical trickery conspires to make it look like I have ten.

If this seems worrying to you, consider that you may be asking too much of this pre-rationality condition. After all, if you have a weird prior, you have a weird prior. In the machine-generating-random-priors world, you already believe that your prior is a good fit for the world. That’s what it is to have a prior. Yes, according to our actual posteriors it seems like there should be no correlation between these random priors and the world they’re in, but asking the pre-rationality condition to make our actual beliefs win out seems like a pretty illicit move.

Another worry is that it seems there’s some spooky action-at-a-distance going on between the pre-rationality condition and the assignment of priors. Once everyone has their priors, the pre-rationality condition is powerless to change them. So how is the pre-rationality condition making it so that everyone has the same prior?

I claim that actually, this presentation of the pre-Bayesian proof is not quite right. According to me, if I’m a Bayesian and believe our priors are equally good, then we must have the same priors. If I’m a pre-Bayesian and believe our priors are equally good, then I must believe that your prior averages out to mine. This latter move is open to the pre-Bayesian (who has uncertainty about priors) but not to the Bayesian (who knows the priors).

I’ll make an argument purely within Bayesianism for believing in equally good priors to having the same prior, and then we’ll see how belief in priors comes in for a pre-Bayesian.

Bayesian prior equality

To get this off the ground, I want to make precise the claim of believing someone’s priors are as good as yours. I’m going to look at 3 ways of doing this. Note that Hanson doesn’t suggest a particular one, so he doesn’t have to accept any of these as what he means, and that might change how well my argument works.

Let’s suppose my prior is p and yours is q. Note, these are fixed functions, not references pointing at my prior and your prior. In the Bayesian framework, we just have our priors, end of story. We don’t reason about cases where our priors were different.

Let’s suppose score is a strictly proper scoring rule (if you don’t know what that means, I’ll explain in a moment). score takes in a probability distribution over a random variable and an actual value for that random variable. It gives more points the more of the probability distribution’s mass is near the actual value. For it to be strictly proper, I uniquely maximise my expected score by reporting my true probability distribution. That is Ep[score(f,X)] is uniquely maximised when f = p.

Let’s also suppose my posterior is p|B, that is (using notation a bit loosely) my prior probability conditioned on some background information B.

Here are some attempts to precisely claim someone’s prior is as good as mine:

For all X, Ep[score(p,X)]=Ep[score(q,X)].

For all X, Ep|B[score(p|B,X)]=Ep|B[score(q|B,X)].

For all X, Ep|B[score(p,X)]=Ep|B[score(q,X)].

(1) says that, according to my prior, your prior is as good as mine. By the definition of a proper scoring rule, this means that your prior is the same as mine.

(2) says that, according to my posterior, the posterior you’d have with my current information is as good as the posterior I have. By the definition of the proper scoring rule, this means that your posterior is equal to my posterior. This is a bit broader than (1), and allows your prior to have already “priced in” some information that I now have.

(3) says that given what we know now, your prior was as good as mine.

That rules out q = p|B. That would be a prior that’s better than mine: it’s just what you get from mine when you’re already certain you’ll observe some evidence (like an apple falling in 1663). Observing that evidence doesn’t change your beliefs.

In general, it can’t be the case that you predicted B as more likely than me, which can be seen by taking X = B.

On future events, your prior can match my prior, or diverge from my posterior equally as far as my prior, but in the opposite direction.

I don’t really like 3, because while it accepts that your prior was as good as mine in the past, it can think that after you update your prior you’ll still be worse than me.

That leaves us with 1 and 2 then. If 1 or 2 are our precise notion, then it follows quickly that we have common priors.

This is just a notion of logical consistency though; I don’t have room for believing that our prior-generating processes make yours as likely to be true as mine. It’s just that if the probability distribution that happens to be your prior appears to me as good as the probability distribution that happens to be my prior, they are the same probability distribution.

Pre-Bayesian prior equality

How to make pre-Bayesian claim that your prior is as good as mine?

Here let, pᵢ be my prior as a reference, rather than as a concrete probability distribution. Claims about pᵢ are claims about my prior, no matter what function that actually ends up being. So for example, claiming that pᵢ scores well is claiming that as we look at different worlds, we see it is likely that my prior is a well-adapted prior for that specific world. In contrast, a claim that p scores well would be a claim that the actual world looks a lot like p.

Similarly, pⱼ is your prior as a reference. Let p be a vector assigning a prior to each agent.

Let f be my pre-prior. That is, my initial beliefs over combinations of worlds and prior assignments. Similarly to above, let f|B be my pre-posterior (a bit of an awkward term, I admit).

For ease of exposition (and I don’t think entirely unreasonably), I’m going to imagine that I know my prior precisely. That is f(w, p) = 0 if pᵢ ≠ p.

Here are some ways of making the belief that your prior is as good as mine precise in the pre-Bayesian framework.

For all X, Ep[score(p,X)]=Ef[score(pⱼ,X)].

For all X, Ep|B[score(p|B,X)]=Ef|B[score(pⱼ|B,X)].

For all X, Ep|B[score(p,X)]=Ef|B[score(pⱼ,X)].

On the LHS, the expectation uses p rather than f, because of the pre-rationality condition. Knowing my prior, my updated pre-prior agrees with it about the probability of the ground events. But I still don’t know your prior, so I have to use f on the RHS to “expect” over the event and your prior itself.

(1) says that, according to my pre-prior, your prior is as good as mine in expectation. The proper scoring rule says that my prior is the unique maximum for a fixed function. But I could, in principle, believe that your prior is better adapted to each world than my prior, but I’m still not certain which world we’re in (or what your prior is), so I can’t update my beliefs.

Given the equality, I can’t want to switch priors with you in general, but I could think you have a prior that’s more correlated with truth than mine in some cases and less so in others.

(2) says that, according to my pre-posterior, your prior conditioned on my info is, in expectation, as good as my prior conditioned on my info.

I like this better than (1). Evidence in the real world leads me to beliefs about the prior production mechanisms (like genes, nurture and so on). These don’t seem to give a good reason for my innate beliefs to be better than anyone else’s. Therefore, I believe your prior is probably as good as mine on average.

But note, I don’t actually know what your prior is. It’s just that I believe we probably share similar priors. The spooky action-at-a-distance is eliminated. This is just (again) a claim about consistent beliefs: if I believe that your prior got generated in a way that made it as good as mine, then I must believe it’s not too divergent from mine.

says that, given what we now know, I think your prior is no better or worse than mine in expectation. This is about as unpalatable in the pre-Bayesian as the Bayesian case.

So, on either (1) or (2), I believe that your prior will, on average, do as well as mine. I may not be sure what your prior is, but cases where it’s far better will be matched by cases where it’s far worse. Even knowing that your prior performs exactly as well as mine, I might not know exactly which prior you have. I know that all the places it does worse will be matched by an equal weight of places where it does better, so I can’t appeal to my prior as a good reason for us to diverge.

So if I pop out with a very certain belief that I have eleven toes, such that no amount of visual evidence that I have ten toes can shake my faith, the pre-prior should indeed place more weight on those worlds where I have eleven toes and various optical trickery conspires to make it look like I have ten.

We all have these blind spots. The one about free will is very common. Physics says there is no such thing, all your decisions are either predetermined, or random or chaotic, yet most of this site is about making rational decisions. We all believe in 11 toes.

Suppose I realized the truth of what you say, and came to believe that I have no free will. How would I go about acting on this belief?

Isn’t the answer to that question “this question is incoherent; you might do some things as a deterministic consequence of coming to hold this new belief, but you have no ability to choose any actions differently on that basis”?

And if there is no way for me to act on a belief, in what sense can that belief be said to have content?

I tried to express my thoughts on the topic in one of the posts. Instead of thinking of ourselves as agents with free will, we can think of agents as observers of the world unfolding before them, continuously adjusting their internal models of it. In this approach there is no such thing as a logical counterfactual, all seeming counterfactuals are artifacts of the observer’s map not reflecting the territory faithfully enough, so two different agents look like the same agent able to make separate decisions. I am acutely aware of the irony and the ultimate futility (or, at least, inconsistency) of one collection of quantum fields (me) trying to convince ( =change the state of) another collection of quantum fields (you) as if there was anything other than physical processes involved, but it’s not like I have a choice in the matter.

Physics says there is no such thing, all your decisions are either predetermined, or random or chaotic

No, free will is not a topic covered in physics textbooks or lectures. You are appealing to an implicit definition of free will that libertarians don’t accept.

it’s not covered, no, it’s not a physics topic at all. I wish I could imagine the mechanism of top-down causation that creates non-illusionary free will that is implicit in MIRI’s work. The most earnest non-BS attempt I have seen so far is Scott Aaronson’s The Ghost in the Quantum Turing Machine, and he basically concedes that, barring an exotic freebit mechanism, we are all quantum automatons.

I like that post a lot, too. Putting compatibilism into the context of modern physics. One does not need to worry about physics when dealing with human interactions, and vice versa. Those are different levels of abstraction, different models. The problem arises when one goes outside the domain of validity of a given model without realizing it. And I see the AI alignment work as crossing that boundary. Counterfactuals are a perfectly fine concept when looking to make better decisions. They are a hindrance when trying to prove theorems about decision making. Hence my original point about blind spots.

You are telescoping a bunch of issues there. It is not at all clear that top-down causation is needed for libertarian free will, for instance. And MIRI says free will is illusory.

You are assuming some relationship between agency and free will that has not been spelt out. Also, an entirely woo-free notion of agency is a ubiquitous topic on this site, as has been pointed out to you before.

In my view, agency is sort of like life—it’s hard to define itself, but the results are fairly obvious. Life tends to spread to fill all possible niches. Agents tend to steer the world toward certain states. But this only shows that you don’t need top-down causation to notice an agent (ignoring how, hypothetically, “notice” is a top-down process; what it means is that “is this an agent” is a fairly objective and non-agenty decision problem).

How can you affect lower levels by doing things on higher levels? By “doing things on a higher level”, what you are really doing is changing the lower level so that it appears a certain way on a higher level.

If what you say is correct, we should expect MIRI to claim that two atom-level identical computers could nonetheless differ in Agency. I strongly predict the opposite—MIRI’s viewpoint is reductionist and physicalist, to the best of my knowledge.

I am not denying that the more complex an animal is, the more agency it appears to possess. On the other hand, the more we know about an agent, the less agency it appears to possess. What ancients thought of as agent-driven behavior, we see as natural phenomena not associated with free will or decision making. We still tend to anthropomorphize natural phenomena a lot (e.g. evolution, fortune), often implicitly assigning agency to them without realizing or admitting it. Teleology can at times be a useful model, of course, even in physics, but especially in programming.

It is also ought to be obvious, but isn’t, that there is a big disconnect between “I decide” and “I am an algorithm”. You can often read here and even in MIRI papers that agents can act contrary to their programming (that’s where counterfactuals show up). A quote from Abram Demski:

Suppose you know that you take the $10. How do you reason about what would happen if you took the $5 instead?

Your prediction, as far as I can tell, has been falsified. An agent magically steps away from its own programming by thinking about counterfactuals.

Either you are right that “you know that you take the $10”, or you are mistaken about this knowledge, not both, unless you subscribe to the model of changing a programming certainty.

1. Are you saying that the idea of a counterfactual inherently requires Transcending Programming, or that thinking about personal counterfactuals requires ignoring the fact that you are programmed?

2. Counterfactuals aren’t real. They do not correspond to logical possibilities. That is what the word means—they are “counter” to “fact”. But in the absence of perfect self-knowledge, and even in the knowledge that one is fully deterministic, you can still not know what you are going to do. So you’re not required to think about something that you know would require Transcending Programming, even if it is objectively the case that you would have to Transcend Programming to do that in reality.

I posted about it before here. Logical Counterfactuals are low-res. I think you are saying the same thing here. And yes, analyzing one’s own decision-making algorithms and adjusting them can be very useful. However, Abtam’s statement, as I understand it, does not have the explicit qualifier of incomplete knowledge of self. Quite the opposite, it says “Suppose you know that you take the $10”, not “You start with a first approximation that you take $10 and then explore further”.

You’re right—I didn’t see my confusion before, but Demski’s views don’t actually make much sense to me. The agent knows for certain that it will take $X? How can it know that without simulating its decision process? But if “simulate what my decision process is, then use that as the basis for counterfactuals” is part of the decision process, you’d get infinite regress. (Possible connection to fixed points?)

I don’t think Demski is saying that the agent would magically jump from taking $X to taking $Y. I think he’s saying that agents which fully understand their own behavior would be trapped by this knowledge because they can no longer form “reasonable” counterfactuals. I don’t think he’d claim that Agenthood can override fundamental physics, and I don’t see how you’re arguing that his beliefs, unbeknownst to him, are based on the assumption that Agenthood can override fundamental physics.

I cannot read his mind, odds are, I misinterpreted what he meant. But if MIRI doesn’t think that counterfactuals as they appear to be (“I could have made a different decision but didn’t, by choice”) are fundamental, then I would expect a careful analysis of that issue somewhere. Maybe I missed it. I have posted on a related topic some five months ago, and had some interesting feedback from jessicata (Jessica Tailor of MIRI) in the comments.

## Believing others’ priors

## Meet the Bayesians

In one way of looking at Bayesian reasoners, there are a bunch of possible worlds and a bunch of people, who start out with some guesses about what possible world we’re in. Everyone knows everyone else’s initial guesses. As evidence comes in, agents change their guesses about which world they’re in via Bayesian updating.

The Bayesians can share information just by sharing how their beliefs have changed.

If they have the same priors, they’ll converge to the same beliefs. But if they don’t, it seems they can agree to disagree. This is a bit frustrating, because we don’t want people to ignore our

very convincing evidencejust because they’ve gotten away with having a stupid weird prior.What can we say about which priors are permissible? Robin Hanson offers an argument that we must either (a) believe our prior was created by a special process that correlated it with the truth more than everyone else’s or (b) our prior must be the same as everyone else’s.

## Meet the pre-Bayesians

How does that argument go? Roughly, Hanson describes a slightly more nuanced set of reasoners: the pre-Bayesians. The pre-Bayesians are not only uncertain about what world they’re in, but also about what everyone’s priors are.

These uncertainties can be tangled together (the joint distribution doesn’t have to factorise into their beliefs about everyone’s priors and their beliefs about worlds). Facts about the world can change their opinions about what prior assignments people have.

Hanson then imposes a pre-rationality condition: if you find out what priors everyone has, you should agree with your prior about how likely different worlds are. In other words, you should trust your prior in the future. Once you have this condition, it seems that it’s impossible to both (a) believe that some other people’s priors were generated in a way that makes them as likely to be good as yours and (b) have different priors from those people.

Let’s dig into the sort of things this pre-rationality condition commits you to.

Consider the class of worlds where you are generated by a machine that randomly generates a prior and sticks it in your head. The pre-rationality rule says that worlds where this randomly-generated prior describes the world well are more likely than worlds where it is a poor description.

So if I pop out with a very certain belief that I have eleven toes, such that no amount of visual evidence that I have ten toes can shake my faith, the pre-prior should indeed place more weight on those worlds where I have eleven toes and various optical trickery conspires to make it look like I have ten.

If this seems worrying to you, consider that you may be asking too much of this pre-rationality condition. After all, if you have a weird prior, you have a weird prior. In the machine-generating-random-priors world, you already believe that your prior is a good fit for the world. That’s what it is to have a prior. Yes, according to our

actualposteriors it seems like there should be no correlation between these random priors and the world they’re in, but asking the pre-rationality condition to make our actual beliefs win out seems like a pretty illicit move.Another worry is that it seems there’s some spooky action-at-a-distance going on between the pre-rationality condition and the assignment of priors. Once everyone has their priors, the pre-rationality condition is powerless to change them. So how is the pre-rationality condition making it so that everyone has the same prior?

I claim that actually, this presentation of the pre-Bayesian proof is not quite right. According to me, if I’m a

Bayesianand believe our priors are equally good, then we must have the same priors. If I’m a pre-Bayesian and believe our priors are equally good, then I mustbelievethat your prior averages out to mine. This latter move is open to the pre-Bayesian (who has uncertainty about priors) but not to the Bayesian (who knows the priors).I’ll make an argument purely within Bayesianism for believing in equally good priors to having the same prior, and then we’ll see how belief in priors comes in for a pre-Bayesian.

## Bayesian prior equality

To get this off the ground, I want to make precise the claim of believing someone’s priors are as good as yours. I’m going to look at 3 ways of doing this. Note that Hanson doesn’t suggest a particular one, so he doesn’t have to accept any of these as what he means, and that might change how well my argument works.

Let’s suppose my prior is

pand yours isq. Note, these are fixed functions, not references pointing at my prior and your prior. In the Bayesian framework, we just have our priors, end of story. We don’t reason about cases where our priors were different.Let’s suppose

scoreis a strictly proper scoring rule (if you don’t know what that means, I’ll explain in a moment). score takes in a probability distribution over a random variable and an actual value for that random variable. It gives more points the more of the probability distribution’s mass is near the actual value. For it to be strictly proper, I uniquely maximise my expected score by reporting my true probability distribution. That is Ep[score(f,X)] is uniquely maximised when f = p.Let’s also suppose my posterior is

p|B, that is (using notation a bit loosely) my prior probability conditioned on some background information B.Here are some attempts to precisely claim someone’s prior is as good as mine:

For all X, Ep[score(p,X)]=Ep[score(q,X)].

For all X, Ep|B[score(p|B,X)]=Ep|B[score(q|B,X)].

For all X, Ep|B[score(p,X)]=Ep|B[score(q,X)].

(1) says that, according to my prior, your prior is as good as mine. By the definition of a proper scoring rule, this means that your prior is the same as mine.

(2) says that, according to my posterior, the posterior you’d have with my current information is as good as the posterior I have. By the definition of the proper scoring rule, this means that your posterior is equal to my posterior. This is a bit broader than (1), and allows your prior to have already “priced in” some information that I now have.

(3) says that given what we know now, your prior was as good as mine.

That rules out q = p|B. That would be a prior that’s better than mine: it’s just what you get from mine when you’re already certain you’ll observe some evidence (like an apple falling in 1663). Observing that evidence doesn’t change your beliefs.

In general, it can’t be the case that you predicted B as more likely than me, which can be seen by taking X = B.

On future events, your prior can match my prior, or diverge from my posterior equally as far as my prior, but in the opposite direction.

I don’t really like 3, because while it accepts that your prior was as good as mine in the past, it can think that after you update your prior you’ll still be worse than me.

That leaves us with 1 and 2 then. If 1 or 2 are our precise notion, then it follows quickly that we have common priors.

This is just a notion of logical consistency though; I don’t have room for believing that our prior-generating processes make yours as likely to be true as mine. It’s just that if the probability distribution that happens to be your prior appears to me as good as the probability distribution that happens to be my prior, they are the same probability distribution.

## Pre-Bayesian prior equality

How to make pre-Bayesian claim that your prior is as good as mine?

Here let, pᵢ be my prior as a reference, rather than as a concrete probability distribution. Claims about pᵢ are claims about my prior, no matter what function that actually ends up being. So for example, claiming that pᵢ scores well is claiming that as we look at different worlds, we see it is likely that my prior is a well-adapted prior for that specific world. In contrast, a claim that p scores well would be a claim that the actual world looks a lot like p.

Similarly, pⱼ is your prior as a reference. Let

pbe a vector assigning a prior to each agent.Let

fbe my pre-prior. That is, my initial beliefs over combinations of worlds and prior assignments. Similarly to above, letf|Bbe my pre-posterior (a bit of an awkward term, I admit).For ease of exposition (and I don’t think entirely unreasonably), I’m going to imagine that I know my prior precisely. That is f(w,

p) = 0 if pᵢ ≠ p.Here are some ways of making the belief that your prior is as good as mine precise in the pre-Bayesian framework.

For all X, Ep[score(p,X)]=Ef[score(pⱼ,X)].

For all X, Ep|B[score(p|B,X)]=Ef|B[score(pⱼ|B,X)].

For all X, Ep|B[score(p,X)]=Ef|B[score(pⱼ,X)].

On the LHS, the expectation uses p rather than f, because of the pre-rationality condition. Knowing my prior, my updated pre-prior agrees with it about the probability of the ground events. But I still don’t know your prior, so I have to use f on the RHS to “expect” over the event and your prior itself.

(1) says that, according to my pre-prior, your prior is as good as mine in expectation. The proper scoring rule says that my prior is the unique maximum

for a fixed function. But I could, in principle, believe that your prior is better adapted to each world than my prior, but I’m still not certain which world we’re in (or what your prior is), so I can’t update my beliefs.Given the equality, I can’t want to switch priors with you in general, but I could think you have a prior that’s more correlated with truth than mine in some cases and less so in others.

(2) says that, according to my pre-posterior, your prior conditioned on my info is, in expectation, as good as my prior conditioned on my info.

I like this better than (1). Evidence in the real world leads me to beliefs about the prior production mechanisms (like genes, nurture and so on). These don’t seem to give a good reason for my innate beliefs to be better than anyone else’s. Therefore, I believe your prior is probably as good as mine on average.

But note, I don’t actually know what your prior is. It’s just that

I believewe probably share similar priors. The spooky action-at-a-distance is eliminated. This is just (again) a claim about consistent beliefs: if I believe that your prior got generated in a way that made it as good as mine, then I must believe it’s not too divergent from mine.says that, given what we now know, I think your prior is no better or worse than mine in expectation. This is about as unpalatable in the pre-Bayesian as the Bayesian case.

So, on either (1) or (2), I believe that your prior will, on average, do as well as mine. I may not be sure what your prior is, but cases where it’s far better will be matched by cases where it’s far worse. Even knowing that your prior performs exactly as well as mine, I might not know exactly which prior you have. I know that all the places it does worse will be matched by an equal weight of places where it does better, so I can’t appeal to my prior as a good reason for us to diverge.

A bit of a side point:

We all have these blind spots. The one about free will is very common. Physics says there is no such thing, all your decisions are either predetermined, or random or chaotic, yet most of this site is about making rational decisions. We all believe in 11 toes.

Suppose I realized the truth of what you say, and came to believe that I have no free will. How would I go about acting on this belief?

Isn’t the answer to that question “this question is incoherent; you might do some things as a deterministic consequence of coming to hold this new belief, but you have no ability to choose any actions differently on that basis”?

And if there is no way for me to act on a belief, in what sense can that belief be said to have content?

I tried to express my thoughts on the topic in one of the posts. Instead of thinking of ourselves as agents with free will, we can think of agents as observers of the world unfolding before them, continuously adjusting their internal models of it. In this approach there is no such thing as a logical counterfactual, all seeming counterfactuals are artifacts of the observer’s map not reflecting the territory faithfully enough, so two different agents look like the same agent able to make separate decisions. I am acutely aware of the irony and the ultimate futility (or, at least, inconsistency) of one collection of quantum fields (me) trying to convince ( =change the state of) another collection of quantum fields (you) as if there was anything other than physical processes involved, but it’s not like I have a choice in the matter.

No, free will is not a topic covered in physics textbooks or lectures. You are appealing to an implicit definition of free will that libertarians don’t accept.

it’s not covered, no, it’s not a physics topic at all. I wish I could imagine the mechanism of top-down causation that creates non-illusionary free will that is implicit in MIRI’s work. The most earnest non-BS attempt I have seen so far is Scott Aaronson’s The Ghost in the Quantum Turing Machine, and he basically concedes that, barring an exotic freebit mechanism, we are all quantum automatons.

I think Sean Carroll does a pretty good job, e.g. in Free Will Is As Real As Baseball.

I like that post a lot, too. Putting compatibilism into the context of modern physics. One does not need to worry about physics when dealing with human interactions, and vice versa. Those are different levels of abstraction, different models. The problem arises when one goes outside the domain of validity of a given model without realizing it. And I see the AI alignment work as crossing that boundary. Counterfactuals are a perfectly fine concept when looking to make better decisions. They are a hindrance when trying to prove theorems about decision making. Hence my original point about blind spots.

You are telescoping a bunch of issues there. It is not at all clear that top-down causation is needed for libertarian free will, for instance. And MIRI says free will is illusory.

I would like to see at least some ideas as to how agency can arise without top-down causation.

You are assuming some relationship between agency and free will that has not been spelt out. Also, an entirely woo-free notion of agency is a ubiquitous topic on this site, as has been pointed out to you before.

I must have missed it or it didn’t make sense to me...

In my view, agency is sort of like life—it’s hard to define itself, but the results are fairly obvious. Life tends to spread to fill all possible niches. Agents tend to steer the world toward certain states. But this only shows that you don’t need top-down causation to notice an agent (ignoring how, hypothetically, “notice” is a top-down process; what it means is that “is this an agent” is a fairly objective and non-agenty decision problem).

How can you affect lower levels by doing things on higher levels? By “doing things on a higher level”, what you are really doing is changing the lower level so that it appears a certain way on a higher level.

If what you say is correct, we should expect MIRI to claim that two atom-level identical computers could nonetheless differ in Agency. I

stronglypredict the opposite—MIRI’s viewpoint is reductionist and physicalist, to the best of my knowledge.I am not denying that the more complex an animal is, the more agency it appears to possess. On the other hand, the more we know about an agent, the less agency it appears to possess. What ancients thought of as agent-driven behavior, we see as natural phenomena not associated with free will or decision making. We still tend to anthropomorphize natural phenomena a lot (e.g. evolution, fortune), often implicitly assigning agency to them without realizing or admitting it. Teleology can at times be a useful model, of course, even in physics, but especially in programming.

It is also ought to be obvious, but isn’t, that there is a big disconnect between “I decide” and “I am an algorithm”. You can often read here and even in MIRI papers that agents can act contrary to their programming (that’s where counterfactuals show up). A quote from Abram Demski:

Your prediction, as far as I can tell, has been falsified. An agent magically steps away from its own programming by thinking about counterfactuals.

It’s been programmed to think about counterfactuals.

Either you are right that “you know that you take the $10”, or you are mistaken about this knowledge, not both, unless you subscribe to the model of changing a programming certainty.

1. Are you saying that the idea of a counterfactual inherently requires Transcending Programming, or that thinking about

personalcounterfactuals requires ignoring the fact that you are programmed?2. Counterfactuals

aren’t real.They do not correspond to logical possibilities. That is what the wordmeans—they are “counter” to “fact”. But in the absence of perfect self-knowledge, and even in the knowledge that one is fully deterministic, you can stillnot knowwhat you are going to do. So you’re not required to think about something that you know would require Transcending Programming, even if it is objectively the case that youwouldhave to Transcend Programming to do that in reality.I posted about it before here. Logical Counterfactuals are low-res. I think you are saying the same thing here. And yes, analyzing one’s own decision-making algorithms and adjusting them can be very useful. However, Abtam’s statement, as I understand it, does not have the explicit qualifier of incomplete knowledge of self. Quite the opposite, it says “Suppose you know that you take the $10”, not “You start with a first approximation that you take $10 and then explore further”.

You’re right—I didn’t see my confusion before, but Demski’s views don’t actually make much sense to me. The agent knows

for certainthat it will take $X? How can it know that without simulating its decision process? But if “simulate what my decision process is, then use that as the basis for counterfactuals” is part of the decision process, you’d get infinite regress. (Possible connection to fixed points?)I don’t think Demski is saying that the agent would magically jump from taking $X to taking $Y. I think he’s saying that agents which fully understand their own behavior would be

trappedby this knowledge because they can no longer form “reasonable” counterfactuals. I don’t think he’d claim that Agenthood can override fundamental physics, and I don’t see how you’re arguing that his beliefs, unbeknownst to him, are based on the assumption that Agenthood can override fundamental physics.