Functional Decision Theory: Not Even Wrong, Also Wrong

Bentham's Bulldog29 Jun 2026 15:34 UTC

−15 points

1 The analytic philosophers vs the rationalists

A lot of analytic philosophers are sympathetic to Rationalism (the social movement, not the alternative to empiricism). I don’t know if I’m senior enough yet to count as a philosopher, but I certainly count myself as among those sympathetic. Yet virtually all of them have the same complaint: Rationalists very often make philosophical errors, especially when it comes to decision theory.

The Rationalist community, for those unaware, is a group devoted to forming beliefs rationally. They disproportionately live in the Bay Area, post on LessWrong, think AI is going to be a big deal, adopt various reductionist philosophical views, etc. I’ve written about my thoughts about the Rationalists here—they’re very smart and interesting; they get a lot right but are sometimes overconfident and wrong about philosophy.

The Rationalist decision theory du jour is called functional decision theory (FDT). Academic decision theorists don’t like the theory. The number of academic decision theorists who adopt it could be counted on one hand by someone missing four of their fingers.1 If some tragedy occurred causing you to lose all your fingers and Ben Levinstein to lose his life, you’d still be able to count the number of academic decision theorists who endorse FDT on one hand. My position on the view is as simple as can be: I think the view is definitely wrong. It both is sufficiently underspecified so as to give no real recommendations, and also the recommendations that it supposedly gives are extremely implausible on their face.

I have had debates with about 5 million Rationalists on this subject. Half my time in the bay area was spent arguing with people about decision theory. When I sleep, I am haunted by the ghosts of FDTers. If you keep saying some point over and over again, it sometimes makes sense to write it up. I thought I’d do that. But if you want to read more from other people who are better at decision theory than me, and also more sensible and measured, read Will MacAskill’s great piece and also Wolfgang Schwarz’s piece.

2 What the heck is FDT?

(Skip this section if you know what each of the main decision theories are).

Decision theories tell you how to get what you want. Specifically, they tell you how to reason about cases where different options get you different amounts of what you want (the amounts of what you want are measured in units of utility. This doesn’t have anything to do with utilitarianism the moral theory—it just denotes the amounts of whatever it is that you’re optimizing for).

There are two major decision theories that academic philosophers like. One is called causal decision theory (CDT). I’m trying to be impartial, so I won’t tell you that it’s (probably) the correct view. It says that you should take the action that causes you to have the most utility. Specifically, it says that when taking an action, you should ignore non-causal influences that your actions might have on the state of the world and only do what causes the best thing.

There’s a second view called evidential decision theory (EDT). It says that you should take the action which leaves you with the expectation of having the most utility. So when deciding between acts A and B, ask: how much utility would I expect to have if I take A? What about if I take B? If you’d expect to have more if you took A than B, then you should take A. If you’d expect to have more if you took B, then you should take B.

Functional decision theory is different from either. It says you should think of your action as determining the outcome of your decision algorithm. You should take the act which is such that across time, you expect to get the most utility if your algorithm outputs that act.

So EDT asks: what action leaves me with the expectation that I’ll be richest? CDT asks: what action causes me to be the richest? And FDT asks: what action would my algorithm outputting make me expect to be the richest if it was settled at the start of time?

Here’s a famous case to distinguish the theories. It’s called Newcomb’s problem. It’s the most famous dilemma in decision theory.

There are two boxes, A and B. You have the option of either taking just A or both A and B. B has $1,000. One hour ago, a very accurate predictor guessed whether you would take both boxes or just box A. If he predicted you would just take box A, he put $2,000 in box A. If he predicted you’d take both boxes, he put nothing in box A.

Question: should you take both boxes or just box A?

CDTers say: both boxes. Taking the second box causes you to get an extra $1,000. The fact that it correlates with there being less money in the box is irrelevant. By taking one box, CDTers claim, you’re just passing up an extra $1,000.

EDTers say: just one box. If you take just the first box, you’ll generally end up with $2,000 instead of $1,000. EDTers say: you expect to end up with more money if you take one box, so you should take one box!

FDTers say: it depends on how the predictor predicts what you’ll do. Suppose they run your algorithm or an algorithm very much like yours to predict what you’ll do. Well then, by changing the results of your algorithm, you change their prediction. So then you should one-box. The output of your algorithm, then, determines how much money is in the box—FDT thinks of your decisions as determining the results of your algorithm.

But suppose instead that they make predictions by looking at some other characteristic that merely correlates with one-boxing. E.g. maybe they look at whether you had a professor that two-boxed. In this case, FDT says you should two box. The predictor isn’t running your algorithm, so changing the outcome of the algorithm doesn’t change what is in the box.

So what’s wrong with FDT? I have two main gripes: what FDT says is wildly underspecified—there’s no remotely plausible way to fill in the details. Also, the few judgments that FDT supposedly gives are often wildly implausible!

3 FDT doesn’t say anything

The biggest problem with FDT is that it is devoid of genuine content.

3.1 Is there a fact about how other functions would be different in the impossible world where mine was?

FDT says that when taking an action, you should consider how the world would be if your decision procedure gave some recommendation. But what does that mean? Specifically, suppose that you are kind of like me but different in a bunch of respects. Maybe you’re my brother. Maybe you’re Claude Opus 4.7 and I’m Claude Opus 4.6. Maybe you’re an almost exact copy of me. Does changing my algorithm change your algorithm? How could we possibly answer this question?

Remember, my decision algorithm is some mathematical function. So we’re asked to imagine in the mathematically impossible world where some math function outputted something different from what it mathematically has to output, whether other mathematical functions would be different. What could this mean? How could there possibly be an answer to this question? How can you have a theory that depends on there being determinate answers to the question: in the logically impossible world where some necessary mathematical fact was different, how would other necessary mathematical facts be different? What?

FDTers often claim that CDT requires considering counterpossibles too, because it instructs you to hold fixed what the world is independent of your choice and then make the decision that maximizes utility with respect to that. Now, even if this is right, it’s a lot sketchier to consider how other algorithms would be different in counterpossible worlds than just considering irrelevant features of generic counterpossibles. But CDT holds fixed only which things causally depend on your act, not the initial conditions. So it never has to consider a situation where, say, the initial conditions determine that you’ll take some act A, yet you take act B. As Wolfgang Schwarz put it:

For another example, Yudkowsky and Soares claim that CDT (like FDT) involves evaluating logically impossible scenarios. For example, “[CDTers] are asking us to imagine the agent’s physical action changing while holding fixed the behavior of the agent’s decision function”. Who says that? I would have thought that when we consider what would happen if you took one box in Newcomb’s Problem, the scenario we’re considering is one in which your decision function outputs one-boxing. We’re not considering an impossible scenario in which your decision function outputs two-boxing, you have complete control over your behaviour, and yet you choose to one-box. There are many detailed formulations of CDT. Yudkowsky and Soares ignore almost all of them and only mention the comparatively sketchy theory of Pearl. But even Pearl’s theory plausibly doesn’t appeal to impossible propositions to evaluate ordinary options. Lewis’s or Joyce’s or Skyrms’s certainly doesn’t.

And note: this isn’t just some minor quibble with what FDT says in a few cases. This is the core mechanic of FDT. This is what FDT needs to generate a single result in a single case! Every case where FDT gives a recommendation, it does so by analyzing the counterfactual where the output of a mathematical function was different. Insofar as there’s no fact of the matter about that, FDT doesn’t give any recommendations in any cases.

Let’s apply this to Newcomb’s problem. Suppose the predictor predicts what I’ll do by running an algorithm. Presumably it won’t be exactly the same algorithm as the one I’m employing. He’s not running an exact mental simulation of me even if his simulation reliably correlates with what I’ll do. Suppose my algorithm will in fact output one-boxing. FDT requires we answer: in the logically impossible world where my algorithm outputted two-boxing, would the predictor’s algorithm output two-boxing? Clearly there’s no fact of the matter about that! So FDT doesn’t even get clear results in Newcomb’s problem! As long as the predictor isn’t running an exact simulation of you, FDT falls silent on the question of what you should do.

3.2 Statistical correlations aren’t enough

Now, there’s an obvious-sounding solution to this problem. Just consider the nearest epistemically possible world where your decision theory outputs some recommendation, and then tabulate the amount of utility you expect to get. So suppose that you learned that your algorithm was disposed to two-box. Then ask: how much money would you expect to get. Compare that to how much you’d expect to get if you learned your algorithm was disposed to one-box. If you’d expect to have more after learning your algorithm one-boxes than two-boxes, then you should one-box.

But this obvious-sounding solution doesn’t work. It makes the theory into updateless EDT.2 To see this, imagine that some people are born with a gene that correlates heavily with two-boxing. The predictor predicts what I’ll do by looking at whether I have the gene. Two-boxing doesn’t cause the gene or affect whether you have the gene in any way. This solution would recommend one-boxing in this case. If I knew that my algorithm was disposed to one-box, I’d have a high credence in my having the gene, and in my getting rich. But FDT isn’t supposed to say that!

In fact, this leaves FDT vulnerable to the very smoker’s lesion result that FDTers take to be decisive against EDT. Imagine that smoking doesn’t cause your health to be worse. Instead, smoking correlates with having a lesion on your lung that both makes you likelier to smoke and makes your health worse. It seems rational to smoke, because smoking has no effect on whether you have the lesion on your lung. Yet if your algorithm outputs smoking, that makes you expect that you have the lesion, and so it lowers the expected utility that you get according to this solution.

Now, you could modify the view once again so that you only analyze your expectations concerning other algorithms. This way, you wouldn’t look at how much utility you’d expect to get if your algorithm outputted some action. Or, at the very least, you wouldn’t take the action which, if your algorithm outputs, leaves you with the highest amount of expected utility. Instead, when deciding between two actions A and B, you’d imagine:

Your algorithm outputting A vs B.
What you expect other algorithms to output if yours outputs A vs if yours outputs B.
Then you count up the utility from you and other algorithms outputting A vs B. Whichever one leaves you with more utility timelessly (we’ll come back to the timeless thing later) is the one you take.

That way, you only analyze your algorithm’s probabilistic impact on other algorithms. Whether you have lung cancer is not an algorithm. So you don’t treat your algorithm being different as affecting it in the way relevant to decision making.

But this is of no help. Imagine a modified case where the lesion doesn’t make your health worse. Instead, there’s an algorithm that checks to see if you have the lesion. If you do, then it makes your health worse and also makes you likelier to smoke. Now there’s an algorithm in the mix, so this view is back to thinking (wrongly, and contrary to the spirit of FDT) that you shouldn’t smoke. After all, your algorithm outputting “don’t smoke” makes you expect that the other algorithm output “is less likely to smoke and has better health.”

So now the FDTers are in pretty rough shape. They need to have some account of how your algorithm outputting A would affect other different algorithms. But this can’t just be about your credence in the other algorithm having some outcome, conditional on yours outputting A. FDT depends on analyzing how your action being different (counterpossibly) would make other algorithms different (counterpossibly) without looking at how likely other algorithms would be different in the nearest epistemically possible world where yours is different. How could there possibly be a satisfying solution to this problem?

What it needs is some precise specification of how similar two algorithms are that doesn’t depend on:

Extraneous factors (e.g. you won’t want to say that how much my algorithm being different affects other agents running algorithms depends on whether they and I make similar jokes).
The degree of correlation between ultimate decisions.

But what could it possibly depend on? Isn’t it obvious that there’s no single privileged joint-carving way to decide the similarity of algorithms that doesn’t just look at the statistical correlation between their outputs? Certainly FDTers owe us some account of how this works. It doesn’t do to call it an unsolved problem, when this is the entire engine of the theory—when there’s no plausible story of what a solution would even look like, strong active reason to think there is no such solution, and a solution is needed for the theory to give any result in any case.3

Let’s be a bit more concrete. Imagine that I’m in a prisoner’s dilemma against my twin (note that my twin isn’t exactly like me but is similar). I understand having a credence in my twin cooperating conditional on my cooperating. But if we’re not talking about conditional credences, how could there be a uniquely privileged sharp fact about the non-statistical algorithmic correlation between us two?

3.3 Perfectly correlated algorithms

FDT has another pretty bad result in this vicinity. Imagine that there’s some gene that correlates 99.9% with two-boxing. The gene is not caused by two-boxing, they just perfectly correlate. Now imagine two different scenarios:

The predictor looks to see if you have the gene. If you don’t, they put $3,000 in the first box. If you do, they put nothing in the first box. The second box has $1,000. Should you take both boxes?
The predictor runs a simulation of you with 99.9% accuracy. The cases where the simulation is inaccurate are the same as the ones where there isn’t an overlap between your gene and which box you take. Thus, there is 100% overlap between the predictor’s judgment in this case and the last. The only difference is that in the last case, they look to see whether you have the gene, while in this case, they run a simulation of you. If they guess that you’ll one-box, they put $3,000 in box one, while if they guess you’ll two-box, they put nothing. The second box has $3,000. Should you take both boxes?

Here FDT’s answer is that you should two-box in the first case but not in the second case. But that’s very implausible. It runs afoul of the following principle:

Equivalent predictions: if there are two methods of prediction that always output the same predictions, your answer in Newcomb’s problem shouldn’t depend on which one was employed.

This principle strikes me as very obvious. One reason for this is that if two predictive algorithms always overlap, then when you know how one turned out, you also know how the other one would have turned out. But if you know how they’d both have turned out, then surely it doesn’t matter which one they actually used.

To see this, imagine that both kinds of predictors are employed. Then, they both send a signal that’s used to influence how much money is in the box. Whichever signal arrives first determines the amount of money in the box. Surely it doesn’t matter which arrives first? Because the two predictive methods always output the same thing, this has no bearing at all on the amount of money in the box!

3.4 No fact about whether two algorithms are the same

Things get even worse. How do we determine if two functions are running the same algorithm? I’m told this is an “unsolved problem” for FDT. There seem to be a lot of those. And remember, you can’t just look at whether they always output the same thing, because FDT distinguishes between mere correlations and paired algorithms. As Will MacAskill put it in his piece:

Even putting the previous issues aside, there’s a fundamental way in which FDT is indeterminate, which is that there’s no objective fact of the matter about whether two physical processes A and B are running the same algorithm or not, and therefore no objective fact of the matter of which correlations represent implementations of the same algorithm or are ‘mere correlations’ of the form that FDT wants to ignore.
…
To see this, consider two calculators. The first calculator is like calculators we are used to. The second calculator is from a foreign land: it’s identical except that the numbers it outputs always come with a negative sign (‘–’) in front of them when you’d expect there to be none, and no negative sign when you expect there to be one. Are these calculators running the same algorithm or not? Well, perhaps on this foreign calculator the ‘–’ symbol means what we usually take it to mean — namely, that the ensuing number is negative — and therefore every time we hit the ‘=’ button on the second calculator we are asking it to run the algorithm ‘compute the sum entered, then output the negative of the answer’. If so, then the calculators are systematically running different algorithms.
But perhaps, in this foreign land, the ‘–’ symbol, in this context, means that the ensuing number is positive and the lack of a ‘–’ symbol means that the number is negative. If so, then the calculators are running exactly the same algorithms; their differences are merely notational.
Ultimately, in my view, all we have, in these two calculators, are just two physical processes. The further question of whether they are running the same algorithm or not depends on how we interpret the physical outputs of the calculator.

Now, as Will notes, standard attempts to measure whether two algorithms are the same generally imply that one system may run many different algorithms simultaneously. If the ultimate account has to do with the mapping between inputs and outputs, then changing the output of your algorithm may have bizarre effects on other features of the world. As Will writes:

For example, if the physical process underlying some aspect of the US economy just happened to be isomorphic with FDT’s algorithm, then in the logically impossible world where FDT outputs a different algorithm, not only does the predictor act differently, but so does the US economy. And that will probably change the value of the world under consideration, in a way that’s clearly irrelevant to the choice at hand.

There’s a related problem. Suppose that there is someone who is psychologically identical to me at all times before Newcomb’s problem. In Newcomb’s problem, they one-box. Should we think of changing the results of my “algorithm” as changing the results of theirs? What could possibly determine this?

There’s a somewhat strange paradox here. Imagine that there’s someone who is psychologically identical to me at all times before the prisoner’s dilemma. I’m in a prisoner’s dilemma against them. They defect. On FDT, I should defect too. But then we’re running the same algorithm. So then I should cooperate. But then we’re running different algorithms, so I should defect.

Now, you might object that the scenario, as I’ve described, is impossible. If I’m basing my decision on theirs, then we can’t be running the exact same algorithm. Here we should imagine that my decision is not based on theirs. We should then consider the question: what action do I have most objective reason to do (instead of which one is best for me to do given what I know).

3.5 Conclusion

So let’s recap. FDT needs a solution to each of the following to give almost any judgment in almost any case:

Determining how one algorithm being different would affect another algorithm being different, without depending on the epistemic probability of the second being different if the first was. This also can’t depend on extraneous factors. Remember additionally that we are imagining how other things would be different in the metaphysically impossible world where some mathematical fact is different, and we can’t just rely on epistemic probabilities! This problem seems obviously fatal.
Determining whether two algorithms are the same. There is no standard way of doing this, and there are deep reasons to think that any solution to this would have bizarre implications—e.g. on unrelated algorithms that happen to be isomorphic.

Then, even if we had a solution to both of those, FDT would have the problem:

It implies that even if two predictive processes are 100% correlated, it would matter which one was used in Newcomb-type problems.
It generates a paradox in cases where an algorithm being the same as yours depends on what you do in some situation.

Absent a solution to the first two, FDT isn’t a theory. It’s a collection of suggestions. In every case that has ever arisen in the history of the species and all the standard thought experiments, it is wildly unclear what FDT says. There are deep reasons to think it doesn’t say anything.

4 Should you light yourself on fire for no benefit?

My answer is “no.” FDT’s answer is “yes.” Here’s the case (from Will MacAskill, though similar examples abound):

Bomb.
You face two open boxes, Left and Right, and you must take one of them. In the Left box, there is a live bomb; taking this box will set off the bomb, setting you ablaze, and you certainly will burn slowly to death. The Right box is empty, but you have to pay $100 in order to be able to take it.
A long-dead predictor predicted whether you would choose Left or Right, by running a simulation of you and seeing what that simulation did. If the predictor predicted that you would choose Right, then she put a bomb in Left. If the predictor predicted that you would choose Left, then she did not put a bomb in Left, and the box is empty.
The predictor has a failure rate of only 1 in a trillion trillion. Helpfully, she left a note, explaining that she predicted that you would take Right, and therefore she put the bomb in Left.
You are the only person left in the universe. You have a happy life, but you know that you will never meet another agent again, nor face another situation where any of your actions will have been predicted by another agent. What box should you choose?

FDT says that you should slowly and painfully burn yourself to death. After all, having the disposition to do that makes you better off in expectation timelessly. It makes it so that probably there won’t be the bomb in the box in the first place, and you won’t have to pay $100.

But this just seems irrelevant. The bomb is in the box. I have no uncertainty about what will happen if I choose Left. In cases where you have no uncertainty about how the world is, where one action simply leaves you with less utility, you shouldn’t take that action. The fact that this case is rare doesn’t matter! It’s a crazy recommendation of FDT that it tells you to light yourself on fire when you know that if you do so, you will not benefit at all.

FDTers I’ve talked to sometimes have said this is unfairly rhetorically loaded. “It’s not for no benefit,” they claim. “The benefit comes from you being better off if your decision algorithm is disposed to make it.” But at the time you’re taking the act, you have no uncertainty about how the world is. You know what benefits will come about if you take the act: none. So this phrasing is accurate.

And there are an infinite number of other similar examples. Imagine that everyone in the world is put into a deep slumber. Then, the predictor simulates you and guesses if you’ll, thirty years after waking up, painfully cut off your leg for no benefit. The simulation is highly correlated with you, so his guesses about whether you’ll cut off your leg for no benefit are 99.9% accurate. If he predicts that you’ll slice off your leg, he wakes you up. If he predicts that you won’t, then he doesn’t. Assume that waking up is very good for you.

FDT implies that because being disposed to slice off your leg for no benefit makes you likelier to wake up, you should slice off your leg thirty years later. But that just seems crazy. At the time you’re making decisions, you’re already awake. If you’re already awake, it makes no sense to slice off your leg on grounds that it makes you likelier to be awake. The odds that you’re awake are already 100%.

Note: decision theories are theories about rationality. They tell you what decisions are wise and sensible to make. They are not theories about the desirable dispositions to have or about how you should program an AI. There are interesting questions about those sorts of things, but they aren’t what decision theory is about. So don’t think to yourself “would I be better off timelessly having the disposition to slice off my leg?” Think “is it rational, at the time I’m making the decision, to cut off my leg for no benefit.” I think the answer is clear: no! I’ll talk more about this distinction in the next section.

There’s a response to this that I’ve heard from a lot of functional decision theorists. Here’s the idea: you don’t really know if you are the algorithm being simulated in Newcomb’s problem or the actual person. For all you know, you might be the simulation, in which case you outputting “one-box” leads to more utility. I find this response very bizarre:

I know I’m not a simulated algorithm. The simulated algorithm isn’t conscious (we can stipulate). I am.
Decision theory generally assumes that you’re self-interested. But if I’m the algorithm, then I care about algorithm me—not the version in the real world. So then I wouldn’t care about what the output of the algorithm was.

Thus, I think FDT gives incorrect recommendations.

5 Does FDT get more utility?

A claim that FDTers are fond of making is that following FDT gets you the most utility. Take the version of Newcomb’s problem where the boxes are transparent, for example. So in this case, you can peer into both boxes and see how much money is in each. In this case, both CDT and EDT recommend taking two boxes. After all, at this point you have no uncertainty about how the world is—taking two boxes leaves you with an extra $1,000. FDTers recommend you take one box, because that timelessly leaves you with more utility.

Thus, FDTers generally leave transparent Newcomb’s problem richer than either EDTers or CDTers. FDT proponents claim that FDT “gets you more utility,” and is thus the right criterion of action. I have four problems with this argument.

First, I don’t think FDT says anything in any case because it’s not a complete theory (see section 2). If FDT says nothing, it can’t get you the most utility.

Second, FDT doesn’t always get you the most utility. For example, consider the following exotic possible world: the actual world. In this one, if you hang around academic philosophers, they will think you’re silly if you adopt FDT. This will make you sad. So adopting FDT gets you less utility. Additionally, in the actual world, I would get less utility if I were an FDTer, because I find it fun to argue with FDTers about decision-theory. Or imagine that the government passed a law where they tortured everyone who thought FDT was the right view. FDTers wouldn’t be better off.

Or imagine the following setup. You’re offered a box full of cash. A predictor predicts if you’d one-box or two-box in Newcomb’s problem. If you one-box, they put nothing in. If you two-box, they put a million dollars in. Now, suddenly, it’s the two-boxers who are rich.

These examples may seem unfair. You directly get rewarded based on things that are downstream of your decision theory. But Newcomb’s problem is also unfair in precisely this way. It ties how much money you get in a box to your judgments in a decision problem.

Now, you can get around this by narrowing the claim. You can say something like “FDT gets you most utility with respect to the utility that’s downstream of your decision algorithm.” But similarly, CDTers can claim “CDT gets you the most utility causally,” and EDTs can claim “EDT gets you the most utility evidentially.” The different theories disagree about what kind of utility is decision-theoretically relevant. So just pounding the table and saying “my theory is best by the lights of the criterion that my theory says is decision-theoretically relevant,” is obviously question-begging.

Third, FDT isn’t actually the theory that leaves you with most expected utility on average. In fact, in many cases, it’s EDT (perhaps updateless) that leaves you with the most expected utility. For example, in the smoker’s lesion case, EDTers tend to finish better-off than other people. In smoker’s lesion, smoking correlates with worse health, but it doesn’t cause it. But EDTers are less likely to smoke, so on average they’ll have better health.

Now, FDTers’ reply will presumably be that what matters isn’t just leaving with the most utility on average. Fair enough. But then they can’t appeal to this criterion. They don’t do best by it. Which kind of utility you get the most of can’t straightforwardly tell you which decision theory is right, because the decision theories disagree about which kind of utility matters.

Fourth, this argument begs the question in a different way. Other theories make a distinction between the disposition that are beneficial to have and the ones that are rational. For example, imagine that a highly reliable predictor checks to see if you’ll give into blackmail for $100. If so, then he blackmails you. If not, he doesn’t. In this case, non-FDT views grant that it’s timelessly better to not give into blackmail. They simply think that once you’re being blackmailed, the rational thing to do is to give in. At that time, you’re simply paying $100 to avoid having your life ruined.

Now, FDTers reject such a distinction. But we’ll need some argument against this distinction. Otherwise, this objection simply assumes that there’s no distinction between dispositions that are rational and ones that are beneficial. Non-FDTers have a perfectly sensible reply to this objection: in situations where you are directly rewarded for being irrational—for making some unwise decision—then of course the irrational people will be better off!

And non-FDTers have their own claim that their theory gets you the most utility. In, say, MacAskill’s bomb case, FDTers blow themselves up while CDTers and EDTers don’t. CDTers and EDTers thus leave with more utility when they’re in this situation.

Non-FDTers can grant: something FDTish might describe the kinds of dispositions you timelessly want to have, depending on how the world is. But that’s different from it being the right account of rationality. The dispositions that are beneficial aren’t necessarily the ones that are rational. Decision theories are theories of rationality, not of how to program an AI. If you are only interested in the question of how to program an AI, don’t purport to be giving a decision theory that is superior to the ones philosophers endorse.

6 Conclusion

FDT is both implausible and underbaked. It sometimes licenses setting yourself on fire for no benefit. It depends on analyzing how other algorithms would be different in the logically impossible world where your algorithm was different, but has no account of how to analyze logically impossible worlds, how to analyze what it means for your algorithm to be different, and how to analyze the impact that your algorithm being different has on other algorithms. This isn’t a minor technicality—it means that there is literally no situation where we can derive the correct answer from the theory.

Permit me to go slightly meta for a moment. Ideas like FDT are not unknown to academic philosophers. Various ideas in the vicinity have been proposed. Indeed, a view like FDT—where you one-box in Newcomb’s problem even if the boxes are transparent—is intuitive to a lot of people. But the view is pretty widely rejected because it doesn’t really hold up when you scrutinize it and filling in the details is very difficult. There’s a line by Scott Alexander that I sometimes think of:

My heuristic is that when the mainstream consensus refuses to engage with a critique and hem and haw about it being “problematic”, they are usually wrong. But when they explicitly declare “This is incorrect” and write papers explaining their reasoning, they are usually right.

The response from academic philosophers has been more in the direction of “write papers explaining their reasoning.” FDTers who think their theory is unfairly neglected by the experts need some explanation of why the academic philosophers who hear of FDT nearly always think it’s wrong.

Among laypeople who hear about decision theory, lots of them adopt something FDTish. So you need some explanation of why it is that nearly all the decision-theory experts—who write monstrously complicated papers with math that would go over your head—think FDT is wrong, but intro philosophy students who know almost nothing about decision theory think that it’s right. In general, you should be skeptical of views that are rejected by ~100% of relevant experts, even after considering them at length.

Bentham's Bulldog29 Jun 2026 15:34 UTC

−15 points

24 comments20 min readLW link

Rationality Functional Decision Theory

David Matolcsi 29 Jun 2026 17:42 UTC
7 points
5
1. Decision theory generally assumes that you’re self-interested. But if I’m the algorithm, then I care about algorithm me—not the version in the real world. So then I wouldn’t care about what the output of the algorithm was.
Yes, I think FDT (and every other decision rule) becomes kind of incoherent if you are trying to do indexical selfishness. But I’m not indexically selfish! If I’m in a simulation, I want the version of me outside the sim to be happy too, and in general for the world to be good. We both know that (under materialistic assumptions) personal identity is not well-defined, and selfishness is irrational. You even wrote a post about this! So I don’t think it’s FDT’s fault that it doesn’t work well under indexical selfishness.

I also find it weird that you stipulate that you can tell about yourself that you are conscious but the simulated being is not. If that’s the case, then there is a noticeable difference between you and the sim that influences your decisions, so the sim is not a good predictor of your behavior.
- Bentham's Bulldog 29 Jun 2026 19:13 UTC
  1 point
  0
  Parent
  Why does CDT become incoherent if you’re indexically selfish?
  Something can be a good predictor between you even if there is a noticeable difference between you and then. The standard Newcomb’s problem doesn’t assume that there’s a conscious organism running in the simulator’s mind.
J Bostock 29 Jun 2026 19:21 UTC
4 points
2
Second, FDT doesn’t always get you the most utility. For example, consider the following exotic possible world: the actual world. In this one, if you hang around academic philosophers, they will think you’re silly if you adopt FDT. This will make you sad. So adopting FDT gets you less utility. Additionally, in the actual world, I would get less utility if I were an FDTer, because I find it fun to argue with FDTers about decision-theory. Or imagine that the government passed a law where they tortured everyone who thought FDT was the right view. FDTers wouldn’t be better off.
What? This is a bad argument, because this doesn’t depend on the decision theory in question at all. For any decision theory XDT, it is possible to construct a world where Omega gives you one bazillion utilons if you don’t follow XDT, and murders you if you do follow XDT. This is part of why these problems are called “unfair”. The point of FDT is that for fair problems like Parfit’s hitchhiker (isomorphic to transparent Newcomb and similar to William MacAskill’s version of Bomb! but not as contrived) FDT wins.
I’m going to spend some time on Parfit’s hitchhiker because it illustrates the issue with EDT + CDT: they can’t commit to anything. I claim that lots of problems like Parfit’s hitchhiker come up in real life all the time. Blackmail is just evil Parfit’s hitchhiker. Lots of employment situations are Parfit’s hitchhiker: someone might hire you if and only if they think you’ll actually do the work (and have limited recourse to stop you if not). FDT is the only decision theoretical framework which lets you commit to anything at all.^[1]
Yes there are non-decision theoretic frameworks which do something like commitment (virtue-ish ethics or deontology) but these aren’t mathematically formulated.
As to your point that FDT isn’t well-defined mathematically yet… uhh… yeah, everyone knows that. That’s one of the main points of the logical uncertainty research agenda. That’s why thousands of keystrokes have been spilled over Lob’s theorem. There are lots of Lobian obstacles to get around when thinking about thinking. It’s possible that something like Logical Inductors (which can handle logical uncertainty) can solve FDT if given the right series of inputs, but I don’t know.
1. ^
  I’m aware that you endorse giving in to blackmail in extremis, which I disagree with as a general position (and I would definitely caution against posting it publicly on the internet).
- Bentham's Bulldog 29 Jun 2026 19:25 UTC
  1 point
  0
  Parent
  I address that. The point is, CDTers have a parallel claim to Newcomb’s problem being unfair.
  The point isn’t just that FDT isn’t well-defined mathematically. I explain in-principle reasons why it can’t be.
Canaletto 29 Jun 2026 17:49 UTC
3 points
0

Third, FDT isn’t actually the theory that leaves you with most expected utility on average. In fact, in many cases, it’s EDT (perhaps updateless) that leaves you with the most expected utility. For example, in the smoker’s lesion case, EDTers tend to finish better-off than other people. In smoker’s lesion, smoking correlates with worse health, but it doesn’t cause it. But EDTers are less likely to smoke, so on average they’ll have better health.

Uhh, suppose you get +1 utility for smoking and −10 utility if you get cancer. A population of EDT agents does not smoke at all, but let’s say 50% of them get cancer. All of the population of CDT agents smoke, and the same 50% of them get cancer.

Question: what population gets higher utility?

EDIT okay, I think I violated the premise of the thought experiment. HMMM. You can postulate that correlation and mechanism was just discovered, what those agents would do on the next time step? So, EDT agents would miss one smoking opportunity and then correct, as correlation vanishes, or maybe they get stuck not smocking at all, if all of them are EDT.
- Bentham's Bulldog 29 Jun 2026 19:15 UTC
  1 point
  0
  Parent
  Yes that’s just a different thought experiment from the one I gave.
  - Canaletto 29 Jun 2026 19:27 UTC
    1 point
    0
    Parent
    But you do agree that in my corrected version, EDT half of the agents miss one smoking opportunity?
    
    E.g. Suppose someone decided to collect statistics for the first time, and will do it again 10 years later. They discover that smocking correlates with cancer and moreover find out that it’s entirely through the lesion. For next 10 years EDT agents do not smoke. All CDT agents do smoke. Then they collect the statistic again, and there is no correlation. EDT agents start smocking too or what? So, they just missed some smocking time?
Gurkenglas 29 Jun 2026 17:18 UTC
3 points
0
The first calculator is like calculators we are used to. The second calculator is from a foreign land: it’s identical except that the numbers it outputs always come with a negative sign (‘–’) in front of them when you’d expect there to be none, and no negative sign when you expect there to be one. Are these calculators running the same algorithm or not?
Yes: Knowing what output the one helps you figure out what output the other gives.
“But wait, don’t we already know what output the other gives? It’s a calculator.”
Yes, this test admittedly only works for programs that are hard enough to run that you notice a shortcut.
Fortunately, the test only needs to work for finding instances of yourself, and you don’t yet know what you will do while you are calculating the consequences of your possible actions. So: In order to figure out whether to one-box, start with the assumption that running your sourcecode on your inputs outputs “one-box”, notice that you can now infer more about what you were predicted to do, act accordingly. Changing in what language you are being predicted does not break this.
- Bentham's Bulldog 29 Jun 2026 17:20 UTC
  1 point
  0
  Parent
  I explain at length why you can’t just rely on correlations between two algorithms wrt their outputs.
  - Gurkenglas 29 Jun 2026 17:37 UTC
    3 points
    0
    Parent
    I’m not trying to rely on statistical correlation. Suppose I’m in a Prisoner’s Dilemma with a twin, I’m sitting in a red room, he’s sitting in a blue room. I am implemented in C and he is implemented in c, which is the same programming language except all the lower- and uppercase characters have the opposite interpretation. In order to figure out whether to cooperate, I assume that FDT cooperates in my situation and see what I can deduce. Let us reason about the computation trace t by which FDT cooperated in my situation. (We can’t write out t fully, since then it would be longer than itself.) If in t we swap all occurences of “blue” and “red”, it is still a legal computation trace; thus we can infer that FDT cooperates also in the situation where I am sitting in a blue room and my twin is sitting in a red room. Next, we can prove that translating a program between C and c never changes behavior; thus we can infer that FDT cooperates also in the situation where my twin is sitting in a blue room and I am sitting in a red room. Now we have shown that if FDT cooperates in my situation, we achieve mutual cooperation. Similarly, if FDT defects in my situation, we achieve mutual defection. Therefore, I cooperate.
Max H 29 Jun 2026 19:37 UTC
2 points
0
So you need some explanation of why it is that nearly all the decision-theory experts—who write monstrously complicated papers with math that would go over your head—think FDT is wrong, but intro philosophy students who know almost nothing about decision theory think that it’s right. In general, you should be skeptical of views that are rejected by ~100% of relevant experts, even after considering them at length.
This is an incorrect characterization of the field. If you define a “relevant expert” as an academic who publishes exclusively in academic philosophy journals (which are mostly read, at best, by other academics), then 15 years ago that might have been a valid (if lame) argument from authority / status. But these days, the most accomplished and high impact philosophers do at least some of their work in industry (e.g. frontier AI labs and non-university-backed research orgs), and a significant chunk of them accept some flavor of FDT as true.
StanislavKrym 29 Jun 2026 16:35 UTC
2 points
−1
I think that you severely misunderstood the way predictors work.
Suppose that you have an army of agents and that a pseudo-predictor runs experiments as follows. For any the th experiment is done as follows. The pseudo-predictor chooses a random number and acts as if the predicted choice for the th agent is the choice of the th agent. Call the th agent the prototype of the th agent and the th agent a descendant of the th one. Also suppose that all the agents optimize for the total utility of the agents while knowing nothing about what number they are at. Even if the agents were causal decision theorists, they would know better than to violate the FDT’s predictions.
Indeed, suppose that the army of agents was put into the Newcomb’s box experiment. If the box is non-transparent, then the entire army has the same probability to one-box (and the same probability to become a descendant of a one-boxer). Then the army of agents receives $2000 for everyone whose prototype one-boxed and $1000 for everyone who two-boxed, expecting to earn $2000(1-p) + $1000p, meaning that is to be set to zero.
Similarly, if the agents face a transparent Newcomb, then suppose that they have a probability to one-box when faced with $2000 in box A and to one-box if they have $0 in box A. Then the solution stays the same: set to 1 and to 0 so that the Predictor wouldn’t become disappointed in the entire class of agents…
- Bentham's Bulldog 29 Jun 2026 16:39 UTC
  3 points
  0
  Parent
  Not following what you think I’m misunderstanding or what this has to do with the things I say. I grant, of course, that the average returns are higher if you’re disposed to one-box in transparent Newcomb. So if you’re just looking at timelessly beneficial dispositions, those are the same across transparent and opaque Newcomb. But the entire dispute is about whether those perfectly match up with rational choice.
  (Let me know if I got your point right).
  - StanislavKrym 29 Jun 2026 16:45 UTC
    2 points
    0
    Parent
    But the entire dispute is about whether those perfectly match up with rational choice.
    It is a rational choice between what and what? Between honestly two-boxing, honestly one-boxing and fooling the Predictor into believing that you’ll one-box, then two-boxing? The problem’s point is that the Predictor isn’t THAT stupid
    - Bentham's Bulldog 29 Jun 2026 16:48 UTC
      1 point
      0
      Parent
      For what you should do in the problem. No one thinks you’ll fool the predictor. What we think is that it is irrational to take some action if, at the time you take it, you know it will simply result in you having less money than if you took a different action. We’re not under the illusion that such people on average leave with more money or that you should precommit to two-boxing or any such things.
      - StanislavKrym 29 Jun 2026 16:55 UTC
        2 points
        1
        Parent
        it is irrational to take some action if, at the time you take it, you know it will simply result in you having less money than if you took a different action
        Suppose, alternatively, that you and another person (or even an optimizer for entirely different values) are both given a chance to pay $1 so that the counterpart received $2. Does it become rational to pay or not to pay?
        Bentham's Bulldog 29 Jun 2026 17:05 UTC
        1 point
        0
        Parent
        I lean towards thinking you shouldn’t pay, but I’m somewhat uncertain about that. I lean causal, but am sympathetic both to EDT and some third undiscovered theory.
programjames 29 Jun 2026 17:31 UTC
−1 points
0
To give a meta summary of the problems in this essay: the author does not define their terms, runs rampant with them, and is then shocked when they run into contradictory intuitions.
Here FDT’s answer is that you should two-box in the first case but not in the second case.
No, it says to two-box in both cases for exactly that axiom of extensionality (“equivalent prediction principle”).
No fact about whether two algorithms are the same
This entire objection is a failure to recognize that 0% and 100% are not probabilities. Rather than saying, “are these the exact same?” which is always impossible to verify to 100% confidence, not just with algorithms, you should ask, “how similar are these policies?” The goal of functional decision theory is to maximize the utility of all agents with similar policies to your own, weighted by exp(-KL(their policy||your policy)). So, for example, your twin that presents exactly the same except on Newcomb’s problem has the maximally distant policy to your own when you run into Newcomb.
The calculator objection is made up. You define your terms and then claim they are not defined! You gave us the isomorphism between the calculator’s algorithm and the neg-calculator’s algorithm. Under the axiom of extensionality, these are the same algorithms. Unless you want to define “algorithm” differently, which involves the minus sign. In which case, they’re slightly different algorithms (see the previous paragraph).
Determining how one algorithm being different would affect another algorithm being different, without depending on the epistemic probability of the second being different if the first was.
Is being exponentially accurate in polynomial time good enough? If so, just use annealing to find the trembling-hand equilibria. If not, you’re asking to solve P=NP.
The predictor has a failure rate of only 1 in a trillion trillion.
How? A rational decision-maker chooses mixed policies (for that entropy bonus), and even when there is certain, painful death on the table, they will almost certainly choose it more than 1 in a trillion trillion times. Even an irrational actor will simply mess up and walk in the wrong direction more often than 1 in a trillion trillion. If they claim they can predict my randomness ahead of time, they are lying. As a good functional decision theorist, I use a quantum random number generator (e.g. stare at a lightbulb while thinking) to prevent my randomness being hacked like that.
The simulation is highly correlated with you, so his guesses about whether you’ll cut off your leg for no benefit are 99.9% accurate.
The issue with most of these scenarios is you are unclear on what you mean by 99.9% accurate. Is this epistemic or aleatoric uncertainty? If it is epistemic, Newcomb is not powerful enough to change his decision based on your algorithm. Your algorithm should be: pretend to be a one-boxer (or leg-cutter) up until actually put in that scenario. If he gives you tests before putting you in the scenario, well an EDT or CDT would certainly do their best to pass the tests as well for the future utility.

If it is only the aleatoric uncertainty in your policy, we have set up as an axiom that Newcomb knows your policy. Then you object, “why not just change the policy?” But you literally just made a stipulation that you can’t! It’s exactly the failure most people make with the Grandfather’s Paradox. What seems to be possible is physically impossible when you impose axiomatic restrictions.
- Gurkenglas 29 Jun 2026 17:47 UTC
  2 points
  0
  Parent
  The goal of functional decision theory is to maximize the utility of all agents with similar policies to your own, weighted by exp(-KL(their policy||your policy)).
  Huh, may I have a source on this? I thought you could point FDT at maximizing any utility function you like.
  - programjames 29 Jun 2026 18:12 UTC
    1 point
    0
    Parent
    The issue with saying, “this agent,” is you do not actually know its policy. The best anyone can do is generate all programs that output the seen distribution of actions, using error-correction codes for nondeterministic policies. Now you have many theories of varying description lengths for the agent, which you weight according to the Solomonoff prior. We can always describe another agent’s policy with a fixed KL(their policy||your policy) extra error-correction bits, so the utils under a given theory are
    
    sum_{policy} exp(-|theory| - KL(policy||your policy)) utils(policy)
    
    and the total utils are
    
    sum_{theory} sum_{policy} … = constant * sum_{policy} exp(-KL(policy||your policy) utils(policy)
    - Gurkenglas 29 Jun 2026 18:21 UTC
      2 points
      0
      Parent
      using error-correction codes for nondeterministic policies
      I assume you mean Arithmetic coding.
      Why do you need to know the policy in order to figure out the utility function? I thought you could point FDT at, like, maximizing Chaitin’s constant. I am hoping to look at whatever reference document you are getting your definitions from, is there no such thing?
      - programjames 29 Jun 2026 18:38 UTC
        1 point
        0
        Parent
        There is no such thing.
- Bentham's Bulldog 29 Jun 2026 19:23 UTC
  1 point
  0
  Parent
  There’s a lot of snark here but it’s all incorrect.
  //No, it says to two-box in both cases for exactly that axiom of extensionality (“equivalent prediction principle”).//
  I explain why this isn’t right. Only one algorithm is dependent on yours. The other is just correlated.
  //This entire objection is a failure to recognize that 0% and 100% are not probabilities. Rather than saying, “are these the exact same?” which is always impossible to verify to 100% confidence, not just with algorithms, you should ask, “how similar are these policies?”//
  This is wrong on a number of counts. First of all, measures of similarities are not the same as probabilities. So this doesn’t require any claim about similarity. Note that FDT wants to say that if you’re in prisoner’s dilemma against a perfect twin, your actions are correlated 100% with what they do (even if you don’t have a credence of 1 that that is so). Second, as I explain, measuring similarity is even more difficult.
  Re calculator, whether they output the same thing depends on how you interpret their outputs, as explained in the post.
  Re being able to determine what’s true of one algorithm from the other, that’s just looking at correlation which can’t be the relevant notion for the reasons I explain.
  We imagine an agent who never messes up an accidentally picks the wrong one. By the lights of FDT, you get more expected utility timelessly if you’re always disposed not to pay. And it’s a stipulation of the thought experiment that the predictor is reliable—doesn’t matter if this could exist in the real world.
  99.9% accurate in the sense that 99.9% of the time, the predictor guesses right. We additionally can imagine that his decision depends on what you do on the last moment, not just on what you’re pretending to do until then.
Bentham's Bulldog 29 Jun 2026 16:31 UTC
−2 points
2
Oh commenters who endorse FDT, how I yearn for thee.