Wei Dai(Wei Dai)
But, the gist of your post seems to be: “Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely.”
This is a bit stronger than how I would phrase it, but basically yes.
On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA
I tend to be pretty skeptical of new ideas. (This backfired spectacularly once, when I didn’t pay much attention to Satoshi when he contacted me about Bitcoin, but I think in general has served me well.) My experience with philosophical questions is that even when some approach looks a stone’s throw away from a final solution to some problem, a bunch of new problems pop up and show that we’re still quite far away. With an approach that is still as early as yours, I just think there’s quite a good chance it doesn’t work out in the end, or gets stuck somewhere on a hard problem. (Also some people who have digged into the details don’t seem as optimistic that it is the right approach.) So I’m reluctant to decrease my probability of “UDT was a wrong turn” too much based on it.
The rest of your discussion about 2TDT-1CDT seems plausible to me, although of course depends on whether the math works out, doing something about monotonicity, and also a solution to the problem of how to choose one’s IBH prior. (If the solution was something like “it’s subjective/arbitrary” that would be pretty unsatisfying from my perspective.)
See this comment and the post that it’s replying to.
Do you think part of it might be that even people with graduate philosophy educations are too prone to being wedded to their own ideas, or don’t like to poke holes at them as much as they should? Because part of what contributes to my wanting to go more meta is being dissatisfied with my own object-level solutions and finding more and more open problems that I don’t know how to solve. I haven’t read much academic philosophy literature, but did read some anthropic reasoning and decision theory literature earlier, and the impression I got is that most of the authors weren’t trying that hard to poke holes in their own ideas.
I don’t understand your ideas in detail (am interested but don’t have the time/ability/inclination to dig into the mathematical details), but from the informal writeups/reviews/critiques I’ve seen of your overall approach, as well as my sense from reading this comment of how far away you are from a full solution to the problems I listed in the OP, I’m still comfortable sticking with “most are wide open”. :)
On the object level, maybe we can just focus on Problem 4 for now. What do you think actually happens in a 2IBH-1CDT game? Presumably CDT still plays D, and what do the IBH agents do? And how does that imply that the puzzle is resolved?
As a reminder, the puzzle I see is that this problem shows that a CDT agent doesn’t necessarily want to become more UDT-like, and for seemingly good reason, so on what basis can we say that UDT is a clear advancement in decision theory? If CDT agents similarly don’t want to become more IBH-like, isn’t there the same puzzle? (Or do they?) This seems different from the playing chicken with a rock example, because a rock is not a decision theory so that example doesn’t seem to offer the same puzzle.
ETA: Oh, I think you’re saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it’s not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?
I think I kind of get what you’re saying, but it doesn’t seem right to model TDT as caring about all other TDT agents, as they would exploit other TDT agents if they could do so without negative consequences to themselves, e.g., if a TDT AI was in a one-shot game where they unilaterally decide whether to attack and take over another TDT AI or not.
Maybe you could argue that the TDT agent would refrain from doing this because of considerations like its decision to attack being correlated with other AIs’ decisions to potentially attack it in other situations/universes, but that’s still not the same as caring for other TDT agents. I mean the chain of reasoning/computation you would go through in the two cases seem very different.
Also it’s not clear to me what implications your idea has even if it was correct, like what does it suggest about what the right decision theory is?
BTW do you have any thoughts on Vanessa Kosoy’s decision theory ideas?
I’m not aware of good reasons to think that it’s wrong, it’s more that I’m just not sure it’s the right approach. I mean we can say that it’s a matter of preferences, problem solved, but unless we can also show that we should be anti-realist about these preferences, or what the right preferences are, the problem isn’t really solved. Until we do have a definitive full solution, it seems hard to be confident that any particular approach is the right one.
It seems plausible that treating anthropic reasoning as a matter of preferences makes it harder to fully solve the problem. I wrote “In general, Updateless Decision Theory converts anthropic reasoning problems into ethical problems.” in the linked post, but we don’t have a great track record of solving ethical problems...
Even items 1, 3, 4, and 6 are covered by your research agenda? If so, can you quickly sketch what you expect the solutions to look like?
The general hope is that slight differences in source code (or even large differences, as long as they’re all using UDT or something close to it) wouldn’t be enough to make a UDT agents defect against another UDT agent (i.e. the logical correlation between their decisions would be high enough), otherwise “UDT agents cooperate with each other in one-shot PD” would be false or not have much practical implications, since why would all UDT agents have the exact same source code?
I’m not sure why you say “if the UDT agents could change their own code (silently) cooperation would immediately break down”, because in my view a UDT agent would reason that if it changed its code (to something like CDT for example), that logically implies other UDT agents also changing their code to do the same thing, so the expected utility of changing its code would be evaluated as lower than not changing its code. So it would remain a UDT agent and still cooperate with other UDT agents or when the probability of the other agent being UDT is high enough.
To me this example is about a CDT agent not wanting to become UDT-like if it found itself in a situation with many other UDT agents, which just seems puzzling if your previous perspective was that UDT is a clear advancement in decision theory and everyone should adopt UDT or become more UDT-like.
Nobody is being tricked though. Everyone knows there’s a CDT agent among the population, just not who, and we can assume they have a correct amount of uncertainty about what the other agent’s decision theory / source code is. The CDT agent still has an advantage in that case. And it is a problem because it means CDT agents don’t always want to become more UDT-like (it seems like there are natural or at least not completely contrived situations, like Omega punishing UDT agents just for using UDT, where they don’t), which takes away a major argument in its favor.
But the situation isn’t symmetrical, meaning if you reversed the setup to have 2 CDT agents and 1 TDT agent, the TDT agent doesn’t do better than the CDT agents, so it does seem like the puzzle has something to do with decision theory, and is not just about smaller vs larger groups? (Sorry, I may be missing your point.)
I feel like MIRI perhaps mispositioned FDT (their variant of UDT) as a clear advancement in decision theory
On second thought this is probably not fair to MIRI since I don’t think I objected to such positioning when they sent paper drafts for me to review. I guess in the early days UDT did look more like a clear advancement, because it seems to elegantly solve several problems at once, namely anthropic reasoning (my original reason to start thinking in the “updateless” direction), counterfactual mugging, cooperation with psychological twin / same decision theory, Newcomb’s problem, and it wasn’t yet known that the open problems would remain open for so long.
UDT shows that decision theory is more puzzling than ever
(Will be using “UDT” below but I think the same issue applies to all subsequent variants such as FDT that kept the “updateless” feature.)
I think this is a fair point. It’s not the only difference between CDT and UDT but does seem to account for why many people find UDT counterintuitive. I made a similar point in this comment. I do disagree with “As such the debate over which is more “rational” mostly comes down to a semantic dispute.” though. There are definitely some substantial issues here.
(A nit first: it’s not that UDT must value all copies of oneself equally but it is incompatible with indexical values. You can have a UDT utility function that values different copies differently, it just has to be fixed for all time instead of changing based on what you observe.)
I think humans do seem to have indexical values, but what to do about it is a big open problem in decision theory. “Just use CDT” is unsatisfactory because as soon as someone could self-modify, they would have incentive to modify themselves to no longer use CDT (and no longer have indexical values). I’m not sure what further implications that has though. (See above linked post where I talked about this puzzle in a bit more detail.)
Thanks for this clear explanation of conceptual analysis. I’ve been wanting to ask some questions about this line of thought:
Where do semantic intuitions come from?
What should we do when different people have different such intuitions? For example you must know that Newcomb’s problem is famously divisive, with roughly half of philosophers preferring one-boxing and half preferring two-boxing. Similarly for trolley thought experiments, intuitions about the nature of morality (metaethics), etc.
How do we make sure that AI has the right intuitions? Maybe in some cases we can just have it learn from humans, but what about:
Cases where humans disagree.
Cases where all/most humans are wrong. (In other words, can we build AIs that have better intuitions than humans?) Or is that not a thing in conceptual analysis, i.e., semantic intuitions can’t be wrong?
Completely novel philosophical questions or situations where AI can’t learn from humans (because humans don’t have intuitions about it either, or AI has to make time sensitive decisions and humans are too slow).
I’m not sure I understand why it would be bad if it actually is a solution. If we do, great, p(doom) drops because now we are much closer to making aligned systems that can help us grow the economy, do science, stabilize society etc. Though of course this moves us into a “misuse risk” paradigm, which is also extremely dangerous.
I prefer to frame it as human-AI safety problems instead of “misuse risk”, but the point is that if we’re trying to buy time in part to have more time to solve misuse/human-safety (e.g. by improving coordination/epistemology or solving metaphilosophy), but the strategy for buying time only achieves a pause until alignment is solved, then the earlier alignment is solved, the less time we have to work on misuse/human-safety.
which is bottlenecked by us running out of time, hence why I think the pragmatic strategic choice is to try to buy us more time.
What are you proposing or planning to do to achieve this? I observe that most current attempts to “buy time” seem organized around convincing people that AI deception/takeover is a big risk and that we should pause or slow down AI development or deployment until that problem is solved, for example via intent alignment. But what happens if AI deception then gets solved relatively quickly (or someone comes up with a proposed solution that looks good enough to decision makers)? And this is another way that working on alignment could be harmful from my perspective...
@jessicata @Connor Leahy @Domenic @Daniel Kokotajlo @romeostevensit @Vanessa Kosoy @cousin_it @ShardPhoenix @Mitchell_Porter @Lukas_Gloor (and others, apparently I can only notify 10 people by mentioning them in a comment)
Sorry if I’m late in responding to your comments. This post has gotten more attention and replies than I expected, in many different directions, and it will probably take a while for me to process and reply to them all. (In the meantime, I’d love to see more people discuss each other’s ideas here.)
Do you have any examples that could illustrate your theory?
It doesn’t seem to fit my own experience. I became interested in Bayesian probability, universal prior, Tegmark multiverse, and anthropic reasoning during college, and started thinking about decision theory and ideas that ultimately led to UDT, but what heuristics could I have been applying, learned from what “domains with feedback”?
Maybe I used a heuristic like “computer science is cool, lets try to apply it to philosophical problems” but if the heuristics are this coarse grained, it doesn’t seem like the idea can explain how detailed philosophical reasoning happens, or be used to ensure AI philosophical competence?
Thanks, I’ve set a reminder to attend your talk. In case I miss it, can you please record it and post a link here?