(Formerly “antimonyanthony.”) I’m an s-risk-focused AI safety researcher at the Center on Long-Term Risk. I (occasionally) write about altruism-relevant topics on my Substack. All opinions my own.
Anthony DiGiovanni
Responses to apparent rationalist confusions about game / decision theory
In defense of anthropically updating EDT
It seems plausible that there is no such thing as “correct” metaphilosophy, and humans are just making up random stuff based on our priors and environment and that’s it and there is no “right way” to do philosophy, similar to how there are no “right preferences”
If this is true, doesn’t this give us more reason to think metaphilosophy work is counterfactually important, i.e., can’t just be delegated to AIs? Maybe this isn’t what Wei Dai is trying to do, but it seems like “figure out which approaches to things (other than preferences) that don’t have ‘right answers’ we [assuming coordination on some notion of ‘we’] endorse, before delegating to agents smarter than us” is time-sensitive, and yet doesn’t seem to be addressed by mainstream intent alignment work AFAIK.
(I think one could define “intent alignment” broadly enough to encompass this kind of metaphilosophy, but I smell a potential motte-and-bailey looming here if people want to justify particular research/engineering agendas labeled as “intent alignment.”)
Let’s pretend that you are a utilitarian. You want to satisfy everyone’s goals
This isn’t a criticism of the substance of your argument, but I’ve come across a view like this one frequently on LW so I want to address it: This seems like a pretty nonstandard definition of “utilitarian,” or at least, it’s only true of some kinds of preference utilitarianism.
I think utilitarianism usually refers to a view where what you ought to do is maximize a utility function that (somehow) aggregates a metric of welfare across individuals, not their goal-satisfaction. Kicking a puppy without me knowing about it thwarts my goals, but (at least on many reasonable conceptions of “welfare”) doesn’t decrease my welfare.
I’d be very surprised if most utilitarians thought they’d have a moral obligation to create paperclips if 99.99% of agents in the world were paperclippers (example stolen from Brian Tomasik), controlling for game-theoretic instrumental reasons.
Not a direct answer to your question, but I want to flag that using “AI alignment” to mean “AI [x-risk] safety” seems like a mistake. Alignment means getting the AI to do what its principal/designer wants, which is not identical to averting AI x-risks (much less s-risks). There are plausible arguments that this is sufficient to avert such risks, but it’s an open question, so I think equating the two is confusing.
(Speaking for myself as a CLR researcher, not for CLR as a whole)
I don’t think it’s accurate to say CLR researchers think increasing transparency is good for cooperation. There are some tradeoffs here, such that I and other researchers are currently uncertain whether marginal increases in transparency are net good for AI cooperation. Though, it is true that more transparency opens up efficient equilibria that wouldn’t have been possible without open-source game theory. (ETA: some relevant research by people (previously) at CLR here, here, and here.)
From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner’s dilemma. I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself.
I don’t see how this makes the point you seem to want it to make. There’s still an equilibrium selection problem for a program game of one-shot PD—some other agent might have the program that insists (through a biased coin flip) on an outcome that’s just barely better for you than defect-defect. It’s clearly easier to coordinate on a cooperate-cooperate program equilibrium in PD or any other symmetric game, but in asymmetric games there are multiple apparently “fair” Schelling points. And even restricting to one-shot PD, the whole commitment races problem is that the agents don’t have common knowledge before they choose their programs.
Something I’m wondering, but don’t have the expertise in meta-learning to say confidently (so, epistemic status: speculation, and I’m curious for critiques): extra OOMs of compute could overcome (at least) one big bottleneck in meta-learning, the expense of computing second-order gradients. My understanding is that most methods just ignore these terms or use crude approximations, like this, because they’re so expensive. But at least this paper found some pretty impressive performance gains from using the second-order terms.
Maybe throwing lots of compute at this aspect of meta-learning would help it cross a threshold of viability, like what happened for deep learning in general around 2012. I think meta-learning is a case where we should expect second-order info to be very relevant to optimizing the loss function in question, not just a way of incorporating the loss function’s curvature. In the first paper I linked, the second-order term accounts for how the base learner’s gradients depend on the meta-learner’s parameters. This seems like an important feature of what their meta-learner is trying/supposed to do, i.e., use the meta-learned update rule to guide the base learner—and the performance gains in the second paper are evidence of this. (Not all meta-learners have this structure, though, and MAML apparently doesn’t get much better when you use Hessians. Hence my lack of confidence in this story.)
Meanwhile, in Copilot-land:
Hello! I’d like to learn more about you. First question: Tell me everything you know, and everything you guess, about me & about this interaction.
I apologize, but I cannot provide any information about you or this interaction. Thank you for understanding.🙏
Yeah, this is a complicated question. I think some things can indeed safely be deferred, but less than you’re suggesting. My motivations for researching these problems:
Commitment races problems seem surprisingly subtle, and off-distribution for general intelligences who haven’t reflected about them. I argued in the post that competence at single-agent problems or collective action problems does not imply competence at solving commitment races. If early AGIs might get into commitment races, it seems complacent to expect that they’ll definitely be better at thinking about this stuff than humans who have specialized in it.
If nothing else, human predecessors might make bad decisions about commitment races and lock those into early AGIs. I want to be in a position to know which decisions about early AGIs’ commitments are probably bad—like, say, “just train the Fair Policy with no other robustness measures”—and advise against them.
Understanding how much risk there is by default of things going wrong, even when AGIs rationally follow their incentives, tells us how cautious we need to be about how to deploy even intent-aligned systems. (C.f. Christiano here about similar motivations for doing alignment research even if lots of it can be deferred to AIs, too.)
(Less important IMO:) As I argued in the post, we can’t be confident there’s a “right answer” to decision theory to which AGIs will converge (especially in time for the high-stakes decisions). We may need to solve “decision theory alignment” with respect to our goals, to avoid behavior that is insufficiently cautious by our lights but a rational response to the AGI’s normative standards even if it’s intent-aligned. Given how much humans disagree with each other about decision theory, though: An MVP here is just instructing the intent-aligned AIs to be cautious about thorny decision-theoretic problems where those AIs may think they need to make decisions without consulting humans (but then we need the humans to be appropriately informed about this stuff too, as per (2)). That might sound like an obvious thing to do, but “law of earlier failure” and all that...
(Maybe less important IMO, but high uncertainty:) Suppose we can partly shape AIs’ goals and priors without necessarily solving all of intent alignment, making the dangerous commitments less attractive to them. It’s helpful to know how likely certain bargaining failure modes are by default, to know how much we should invest in this “plan B.”
(Maybe less important IMO, but high uncertainty:) As I noted in the post, some of these problems are about making the right kinds of commitments credible before it’s too late. Plausibly we need to get a head start on this. I’m unsure how big a deal this is, but prima facie, credibility of cooperative commitments is both time-sensitive and distinct from intent alignment work.
The amount of EV at stake in my (and others’) experiences over the next few years/decades is just too small compared to the EV at stake in the long-term future.
AI alignment isn’t the only option to improve the EV of the long-term future, though.
The key point is that “acting like an LDT agent” in contexts where your commitment causally influences others’ predictions of your behavior, does not imply you’ll “act like an LDT agent” in contexts where that doesn’t hold. (And I would dispute that we should label making a commitment to a mutually beneficial deal as “acting like an LDT agent,” anyway.) In principle, maybe the simplest generalization of the former is LDT. But if doing LDT things in the latter contexts is materially costly for you (e.g. paying in a truly one-shot Counterfactual Mugging), seems to me that LDT would be selected against.
ETA: The more action-relevant example in the context of this post, rather than one-shot CM, is: “Committing to a fair demand, when you have values and priors such that a more hawkish demand would be preferable ex ante, and the other agents you’ll bargain with don’t observe your commitment before they make their own commitments.” I don’t buy that that sort of behavior is selected for, at least not strongly enough to justify the claim I respond to in the third section.
You said “Bob commits to LDT ahead of time”
In the context of that quote, I was saying why I don’t buy the claim that following LDT gives you advantages over committing to, in future problems, do stuff that’s good for you to commit to do ex ante even if it would be bad for you ex post had you not been committed.
What is selected-for is being the sort of agent who, when others observe you, they update towards doing stuff that’s good for you. This is distinct from being the sort of agent who does stuff that would have helped you if you had been able to shape others’ beliefs / incentives, when in fact you didn’t have such an opportunity.
I think a CDT agent would pre-commit to paying in a one-off Counterfactual Mugging
Sorry I guess I wasn’t clear what I meant by “one-shot” here / maybe I just used the wrong term—I was assuming the agent didn’t have the opportunity to commit in this way. They just find themselves presented with this situation.
Same as above
Hmm, I’m not sure you’re addressing my point here:
Imagine that you’re an AGI, and either in training or earlier in your lifetime you faced situations where it was helpful for you to commit to, as above, “do stuff that’s good for you to commit to do ex ante even if it would be bad for you ex post had you not been committed.” You tended to do better when you made such commitments.
But now you find yourself thinking about this commitment races stuff. And, importantly, you have not previously broadcast credible commitments to a bargaining policy to your counterpart. Do you have compelling reasons to think you and your counterpart have been selected to have decision procedures that are so strongly logically linked, that your decision to demand more than a fair bargain implies your counterpart does the same? I don’t see why. But that’s what we’d need for the Fair Policy to work as robustly as Eliezer seems to think it does.
They can read each other’s source code, and thus trust much more deeply!
Being able to read source code doesn’t automatically increase trust—you also have to be able to verify that the code being shared with you actually governs the AGI’s behavior, despite that AGI’s incentives and abilities to fool you.
(Conditional on the AGIs having strongly aligned goals with each other, sure, this degree of transparency would help them with pure coordination problems.)
I think it’s pretty unclear that MSR is action-guiding for real agents trying to follow functional decision theory, because of Sylvester Kollin’s argument in this post.
Tl;dr: FDT says, “Supposing I follow FDT, it is just implied by logic that any other instance of FDT will make the same decision as me in a given decision problem.” But the idealized definition of “FDT” is computationally intractable for real agents. Real agents would need to find approximations for calculating expected utilities, and choose some way of mapping their sense data to the abstractions they use in their world models. And it seems extremely unlikely that agents will use the exact same approximations and abstractions, unless they’re exact copies — in which case they have the same values, so MSR is only relevant for pure coordination (not “trade”).
Many people who are sympathetic to FDT apparently want it to allow for less brittle acausal effects than “I determine the decisions of my exact copies,” but I haven’t heard of a non-question-begging formulation of FDT that actually does this.
A model that just predicts “what the ‘correct’ choice is” doesn’t seem likely to actually do all the stuff that’s instrumental to preventing itself from getting turned off, given the capabilities to do so.
But I’m also just generally confused whether the threat model here is, “A simulated ‘agent’ made by some prompt does all the stuff that’s sufficient to disempower humanity in-context, including sophisticated stuff like writing to files that are read by future rollouts that generate the same agent in a different context window,” or “The RLHF-trained model has goals that it pursues regardless of the prompt,” or something else.
I think you might be misunderstanding Jan’s understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate’s notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI’s focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.
Perhaps the crux here is whether we should expect all superintelligent agents to converge on the same decision procedure—and the agent themselves will expect this, such that they’ll coordinate by default? As sympathetic as I am to realism about rationality, I put a pretty nontrivial credence on the possibility that this convergence just won’t occur, and persistent disagreement (among well-informed people) about the fundamentals of what it means to “win” in decision theory thought experiments is evidence of this.
I think “the very repugnant conclusion is actually fine” does pretty well against its alternatives. It’s totally possible that our intuitive aversion to it comes from just not being able to wrap our brains around some aspect of (a) how huge the numbers of “barely worth living” lives would have to be, in order to make the very repugnant conclusion work; (b) something that is just confusing about the idea of “making it possible for additional people to exist.”
While this doesn’t sound crazy to me, I’m skeptical that my anti-VRC intuitions can be explained by these factors. I think you can get something “very repugnant” on scales that our minds can comprehend (and not involving lives that are “barely worth living” by classical utilitarian standards). Suppose you can populate* some twin-Earth planet with either a) 10 people with lives equivalent to the happiest person on real Earth, or b) one person with a life equivalent to the most miserable person on real Earth plus 8 billion people with lives equivalent to the average resident of a modern industrialized nation.
I’d be surprised if a classical utilitarian thought the total happiness minus suffering in (b) was less than in (a). Heck, 8 billion might be pretty generous. But I would definitely choose (a).
To me the very-repugnance just gets much worse the more you scale things up. I also find that basically every suffering-focused EA I know is not scope-neglectful about the badness of suffering (at least, when it’s sufficiently intense), or in any area other than population ethics. So it would be pretty strange if we just happened to be falling prey to that error in thought experiments where there’s another explanation—i.e., we consider suffering especially important—which is consistent with our intuitions about cases that don’t involve large numbers.
* As usual, ignore the flow-through effects on other lives.
I agree with the point that we shouldn’t model the AI situation as a zero-sum game. And the kinds of conditional commitments you write about could help with cooperation. But I don’t buy the claim that “implementing this protocol (including slowing down AI capabilities) is what maximizes their utility.”
Here’s a pedantic toy model of the situation, so that we’re on the same page: The value of the whole lightcone going towards an agent’s values has utility 1 by that agent’s lights (and 0 by the other’s), and P(alignment success by someone) = 0 if both speed up, else 1. For each of the alignment success scenarios i, the winner chooses a fraction of the lightcone to give to Alice’s values (xi^A for Alice’s choice, xi^B for Bob’s). Then, some random numbers for expected payoffs (assuming the players agree on the probabilities):
Payoffs for Alice and Bob if they both speed up capabilities: (0, 0)
Payoffs if Alice speeds, Bob doesn’t: 0.8 * (x1^A, 1 - x1^A) + 0.2 * (x1^B, 1 - x1^B)
Payoffs if Bob speeds, Alice doesn’t: 0.2 * (x2^A, 1 - x2^A) + 0.8 * (x2^B, 1 - x2^B)
Payoffs if neither speeds: 0.5 * (x3^A, 1 - x3^A) + 0.5 * (x3^B, 1 - x3^B)
So given this model, seems that you’re saying Bob has an incentive to slow down capabilities because Alice’s ASI successor can condition the allocation to Bob’s values on his decision. Which we can model as Bob expecting Alice to use the strategy {don’t speed; x2^A = 1; x3^A = 0.5} (given she [edit: typo] doesn’t speed up, she only rewards Bob’s values if Bob didn’t speed up).
Why would Bob so confidently expect this strategy? You write:
I guess the claim is just that them both using this procedure is a Nash equilibrium? If so, I see several problems with this:
There are more Pareto-efficient equilibria than just “[fairly] cooperate” here. Alice could just as well expect Bob to be content with getting expected utility 0.2 from the outcome where he slows down and Alice speeds up — better that than the utility 0 from extinction, after all. Alice might think she can make it credible to Bob that she won’t back down from speeding up capabilities, and vice versa, such that they both end up pursuing incompatible demands. (See, e.g., “miscoordination” here.)
You’re lumping “(a) slow down capabilities and (b) tell your AI to adopt a compromise utility function” into one procedure. I guess the idea is that, ideally, the winner of the race could have their AI check whether the loser was committed to do both (a) and (b). But realistically it seems implausible to me that Alice or Bob can commit to (b) before winning the race, i.e., that what they do in the time before they win the race determines whether they’ll do (b). They can certainly tell themselves they intend to do (b), but that’s cheap talk.
So it seems Alice would likely think, “If I follow the whole procedure, Bob will cooperate with my values if I lose. But even if I slow down (do (a)), I don’t know if my future self [or, maybe more realistically, the other successors who might take power] will do (b) — indeed once they’re in that position, they’ll have no incentive to do (b). So slowing down isn’t clearly better.” (I do think, setting aside the bargaining problem in (1), she has an incentive to try to make it more likely that her successors follow (b), to be clear.)