Researcher at the Center on Long-Term Risk. All opinions my own.
Anthony DiGiovanni
Note that the real world contains many Newcomb-like problems. People do in fact go around making decisions that depend on their beliefs about other people’s decision algorithms.
I disagree, see here for why. (I think “decisions that depend on their beliefs about other people’s decision algorithms” is too weak to get you a Newcomblike structure.)
Cluelessness: Summary of the argument, why it matters, and counterarguments
Thanks Vojta!
I agree that thinking about concrete scenarios is important. But I’m not exactly sure what you have in mind here: “The hope behind this is that it would give us intuition pumps with which progress on SPIs would get easier and faster.” What’s a quick example?
Is not a particularly good description of what academia and the academic peer review process is actually doing in many fields, including academic philosophy
I don’t understand why you think this. I know academia has pathologies, sure, but it seems pretty clear — as a mechanistic claim, not an “outside view heuristic” — that academic philosophers are trained to scrutinize philosophical arguments. I don’t think you’ve really explained why I should believe that “accomplished and high impact” people have more experience with scrutinizing philosophical arguments than academic philosophers.
You’re stipulating that CDT-me in your thought experiment doesn’t have access to any (psychological) actions that causally bind me to not steal from you. Right? Then sure, CDT-me would steal if he ended up in your house, and you’d want to prevent this.
But you’re also stipulating that I do have access to the action “decide to follow FDT”. That’s something that would causally bind me to not steal from you, if I took it before you made your decision whether to hire me. Why is this action a legitimate option in the hypothetical, while various other non-FDT ways of binding oneself aren’t?
Insofar as we think we should defer to some extent to [group of people] on [topic], shouldn’t we defer to [group of people whose job it is to scrutinize arguments on that topic] more so than [group of people who “are most accomplished and high impact”]? What does the latter have to do with their expertise about decision theory?
it illustrates the issue with EDT + CDT: they can’t commit to anything
What does this mean? Of course an agent who endorses EDT or CDT can commit to things — commitments are actions they decide between, like anything else.
Have you asked Katja what she’d prefer here? (I worry people might get a negative impression of her from this even though she didn’t intend it.)
A high-level model of AI bargaining
[Linkpost] Evals for “SPI-incompatible” behavior & reasoning: Guide to initial research
It seems unnecessarily confusing to use the word “caring” for “putting weight on something in a way that isn’t ‘objective’, in the sense of empirical evidence plus logic”. (I know this has precedent in this community, tbc, I’m pushing back on that too.) I don’t assign very high probability to the sun rising tomorrow because I “care” a lot more about sun-rises-tomorrow hypotheses, I do that because I find the epistemological norms that ground induction intuitive. These are just different things, even if they share the property of being non-objective.
I believe basically everything can be formulated as a “bet”, and I don’t quite see what could be there about probabilities that can’t be phrased this way.
”What do you anticipate happening?” From my perspective, anticipation is nothing else than thinking about the consequences of an event. That’s useful if the event happens, and a waste of time if it doesn’t. Therefore, whether I anticipate an event translates to whether I want to bet my time on thinking about it.
”Aren’t you surprised by this event?” To me, surprisal is just getting into a situation that I didn’t make plans for. It’s equivalent to losing a bet: I wagered my time on thinking about the consequences of the other possibility, but the outcome that I didn’t bet on had come to pass.You can reformulate the anticipation/surprisal interpretations of probability in terms of bets, but I don’t think this is much of a positive argument for that approach. I would say, you should bet in some way for good reasons. The anticipation/surprisal interpretations at least gesture at what those reasons are: you expect better consequences from betting that way.
I think this is important, because the “probabilities are just betting odds” meme unnecessarily rules out (e.g.) imprecise probabilities by fiat.
(I’m not sure I endorse the anticipation/surprisal framings either, exactly. I think of probability more in terms of degrees of plausibility. See here for a bit more.)
good odds of changing stuff
Part of the problem OP points out is that “changing stuff” =/= “changing stuff positively”. (“Here’s one. Crucial considerations and sign flips are common.”)
These don’t seem like esoteric failures where it’s plausible that the hypothetical isn’t fair, or where the dynamic inconsistency is plausibly explained as being caused by changing values, or where there’s a fundamental tradeoff between the harms and value of new information. The errors happen in very mundane situations
Setting aside whether these count as “errors” (I basically agree with Jesse’s comment), I’m not sure why you think the above. Paul’s case involves this assumption: “We live in a very big universe where many copies of me all face the exact same decision. This seems plausible for a variety of reasons; the best one is accepting an interpretation of quantum mechanics without collapse (a popular view).”
In one sense this is mundane, since if you buy any of the large world theories, you’d think we’ve always been in such a large world. Maybe that’s what you mean. But it’s very non-mundane in the sense that these large worlds are still clearly out-of-distribution for our intuitions. We haven’t directly experienced the consequences of lots of copies of ourselves doing things. This is one big reason I’m pretty suspicious of my pre-theoretic intuitions about Paul’s case.
Lesser-known LLMisms, in my experience:
“gloss”
“My lean:”
“Does that land?”
“Net:”
Using a verb without a direct object, when that verb usually is paired with a direct object. (Not incorrect, but it sounds weird.)
“posture”
“throat-clearing”
Sorry, I don’t yet see how this engages with the argument in my post.
Your original post’s claims, as I understand them: “There’s a policy such that (1) if your counterpart best-responds to that policy, you all split things fairly, and (2) if your counterpart doesn’t best-respond, you all don’t totally destroy the pie. And (sec 2.4) it’s fine that other agents might make commitments before you make any decisions, because it’s not rational for other agents to commit to threaten you if you have this policy.”
My post’s counterargument: “That second sentence ignores the problem. You’re saying, it’s not rational ex post for some agent Bob to commit to threaten you if they know you have this policy. But the whole problem is that Bob might have thought ex ante that you would use some other policy.”
Why I don’t think you’ve addressed my counterargument:
(Your first bullet seems orthogonal to my counterargument.)
Second bullet: Bob might think ex ante that (a) you won’t follow FDT, [1] and (b) you’ll be more likely to meet his demands if he sticks with his unfair demand than if he doesn’t. Hence we haven’t dodged “the whole problem” above.
Third bullet: I think this is a really strong (empirical) claim, which you’ve only asserted rather than argued for.
- ↩︎
(Or rather, whichever version of “FDT” is stipulated by definition to never give in — I think this is debatable under the standard definition of FDT.)
There’s more explanation of why work on AI conflict is time-sensitive / not purely deferrable to AIs in this sequence, which I link to in the intro. And as I note in the intro:
We agree SPIs will likely be used by default. However, this is arguably not overwhelmingly likely, because AIs or humans in the loop might mistakenly lock out the opportunity to use SPIs later. It’s unclear if default capabilities progress will generalize to careful reasoning about novel bargaining approaches. So, given the large stakes of conflicts that SPIs could prevent, making SPI implementation even more likely seems promising overall.
To be clear, I don’t expect this to be super compelling on its own. I’m assuming readers have background on CLR’s agenda more generally. And this agenda is meant to motivate and explain our strategy on an approach that’s particularly promising if you already share CLR’s general prioritization.
Re: this:
And how would testing current AIs be relevant to the bargaining behavior of the powerful AIs you’re worried about?
My sense is that it’s fairly common in AI safety generally to study current AIs in preparation for the actually dangerous AIs. I don’t understand why you think this methodology is especially ill-suited for our threat model.
Can you say more how you’d respond to the problems with that kind of proposal discussed here?
Yep. Presumably BB’s implicit claim here is something like: “People make claims that we should do such and such thing because of FDT, but those claims don’t follow from ‘we should program an AI to follow FDT’.”
For example, here’s Richard Ngo saying we should cooperate with the values of civilizations outside our lightcone because of FDT.