(Formerly “antimonyanthony.”) I’m an s-risk-focused AI safety researcher at the Center on Long-Term Risk. I (occasionally) write about altruism-relevant topics on my blog, Ataraxia. All opinions my own.
Anthony DiGiovanni
Thanks!
It’s true that you usually have some additional causal levers, but none of them are the exact same as be the kind of person who does X.
Not sure I understand. It seems like “being the kind of person who does X” is a habit you cultivate over time, which causally influences how people react to you. Seems pretty analogous to the job candidate case.
if CDT agents often modify themselves to become an LDT/FDT agent then it would broadly seem accurate to say that CDT is getting outcompeted
See my replies to interstice’s comment—I don’t think “modifying themselves to become an LDT/FDT agent” is what’s going on, at least, there doesn’t seem to be pressure to modify themselves to do all the sorts of things LDT/FDT agents do. They come apart in cases where the modification doesn’t causally influence another agent’s behavior.
(This seems analogous to claims that consequentialism is self-defeating because the “consequentialist” decision procedure leads to worse consequences on average. I don’t buy those claims, because consequentialism is a criterion of rightness, and there are clearly some cases where doing the non-consequentialist thing is a terrible idea by consequentialist lights even accounting for signaling value, etc. It seems misleading to call an agent a non-consequentialist if everything they do is ultimately optimizing for achieving good consequences ex ante, even if they adhere to some rules that have a deontological vibe and in a given situation may be ex post suboptimal.)
It seems plausible that there is no such thing as “correct” metaphilosophy, and humans are just making up random stuff based on our priors and environment and that’s it and there is no “right way” to do philosophy, similar to how there are no “right preferences”
If this is true, doesn’t this give us more reason to think metaphilosophy work is counterfactually important, i.e., can’t just be delegated to AIs? Maybe this isn’t what Wei Dai is trying to do, but it seems like “figure out which approaches to things (other than preferences) that don’t have ‘right answers’ we [assuming coordination on some notion of ‘we’] endorse, before delegating to agents smarter than us” is time-sensitive, and yet doesn’t seem to be addressed by mainstream intent alignment work AFAIK.
(I think one could define “intent alignment” broadly enough to encompass this kind of metaphilosophy, but I smell a potential motte-and-bailey looming here if people want to justify particular research/engineering agendas labeled as “intent alignment.”)
You said “Bob commits to LDT ahead of time”
In the context of that quote, I was saying why I don’t buy the claim that following LDT gives you advantages over committing to, in future problems, do stuff that’s good for you to commit to do ex ante even if it would be bad for you ex post had you not been committed.
What is selected-for is being the sort of agent who, when others observe you, they update towards doing stuff that’s good for you. This is distinct from being the sort of agent who does stuff that would have helped you if you had been able to shape others’ beliefs / incentives, when in fact you didn’t have such an opportunity.
I think a CDT agent would pre-commit to paying in a one-off Counterfactual Mugging
Sorry I guess I wasn’t clear what I meant by “one-shot” here / maybe I just used the wrong term—I was assuming the agent didn’t have the opportunity to commit in this way. They just find themselves presented with this situation.
Same as above
Hmm, I’m not sure you’re addressing my point here:
Imagine that you’re an AGI, and either in training or earlier in your lifetime you faced situations where it was helpful for you to commit to, as above, “do stuff that’s good for you to commit to do ex ante even if it would be bad for you ex post had you not been committed.” You tended to do better when you made such commitments.
But now you find yourself thinking about this commitment races stuff. And, importantly, you have not previously broadcast credible commitments to a bargaining policy to your counterpart. Do you have compelling reasons to think you and your counterpart have been selected to have decision procedures that are so strongly logically linked, that your decision to demand more than a fair bargain implies your counterpart does the same? I don’t see why. But that’s what we’d need for the Fair Policy to work as robustly as Eliezer seems to think it does.
Yeah, this is a complicated question. I think some things can indeed safely be deferred, but less than you’re suggesting. My motivations for researching these problems:
Commitment races problems seem surprisingly subtle, and off-distribution for general intelligences who haven’t reflected about them. I argued in the post that competence at single-agent problems or collective action problems does not imply competence at solving commitment races. If early AGIs might get into commitment races, it seems complacent to expect that they’ll definitely be better at thinking about this stuff than humans who have specialized in it.
If nothing else, human predecessors might make bad decisions about commitment races and lock those into early AGIs. I want to be in a position to know which decisions about early AGIs’ commitments are probably bad—like, say, “just train the Fair Policy with no other robustness measures”—and advise against them.
Understanding how much risk there is by default of things going wrong, even when AGIs rationally follow their incentives, tells us how cautious we need to be about how to deploy even intent-aligned systems. (C.f. Christiano here about similar motivations for doing alignment research even if lots of it can be deferred to AIs, too.)
(Less important IMO:) As I argued in the post, we can’t be confident there’s a “right answer” to decision theory to which AGIs will converge (especially in time for the high-stakes decisions). We may need to solve “decision theory alignment” with respect to our goals, to avoid behavior that is insufficiently cautious by our lights but a rational response to the AGI’s normative standards even if it’s intent-aligned. Given how much humans disagree with each other about decision theory, though: An MVP here is just instructing the intent-aligned AIs to be cautious about thorny decision-theoretic problems where those AIs may think they need to make decisions without consulting humans (but then we need the humans to be appropriately informed about this stuff too, as per (2)). That might sound like an obvious thing to do, but “law of earlier failure” and all that...
(Maybe less important IMO, but high uncertainty:) Suppose we can partly shape AIs’ goals and priors without necessarily solving all of intent alignment, making the dangerous commitments less attractive to them. It’s helpful to know how likely certain bargaining failure modes are by default, to know how much we should invest in this “plan B.”
(Maybe less important IMO, but high uncertainty:) As I noted in the post, some of these problems are about making the right kinds of commitments credible before it’s too late. Plausibly we need to get a head start on this. I’m unsure how big a deal this is, but prima facie, credibility of cooperative commitments is both time-sensitive and distinct from intent alignment work.
The key point is that “acting like an LDT agent” in contexts where your commitment causally influences others’ predictions of your behavior, does not imply you’ll “act like an LDT agent” in contexts where that doesn’t hold. (And I would dispute that we should label making a commitment to a mutually beneficial deal as “acting like an LDT agent,” anyway.) In principle, maybe the simplest generalization of the former is LDT. But if doing LDT things in the latter contexts is materially costly for you (e.g. paying in a truly one-shot Counterfactual Mugging), seems to me that LDT would be selected against.
ETA: The more action-relevant example in the context of this post, rather than one-shot CM, is: “Committing to a fair demand, when you have values and priors such that a more hawkish demand would be preferable ex ante, and the other agents you’ll bargain with don’t observe your commitment before they make their own commitments.” I don’t buy that that sort of behavior is selected for, at least not strongly enough to justify the claim I respond to in the third section.
Responses to apparent rationalist confusions about game / decision theory
Exploitation means the exploiter benefits. If you are a rock, you can’t be exploited. If you are an agent who never gives in to threats, you can’t be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won’t benefit them, then you might get nasty things done to you. You wouldn’t be exploited, but you’d still be very unhappy.
Cool, I think we basically agree on this point then, sorry for misunderstanding. I just wanted to emphasize the point I made because “you won’t get exploited if you decide not to concede to bullies” is kind of trivially true. :) The operative word in my reply was “robustly,” which is the hard part of dealing with this whole problem. And I think it’s worth keeping in mind how “doing nasty things to you anyway even though it won’t benefit them” is a consequence of a commitment that was made for ex ante benefits, it’s not the agent being obviously dumb as Eliezer suggests. (Fortunately, as you note in your other comment, some asymmetries should make us think these commitments are rare overall; I do think an agent probably needs to have a pretty extreme-by-human-standards, little-to-lose value system to want to do this… but who knows what misaligned AIs might prefer.)
It also has a deontological or almost-deontological constraint that prevents it from getting exploited.
I’m not convinced this is robustly possible. The constraint would prevent this agent from getting exploited conditional on the potential exploiters best-responding (being “consequentialists”). But it seems to me the whole heart of the commitment races problem is that the potential exploiters won’t necessarily do this, indeed depending on their priors they might have strong incentives not to. (And they might not update those priors for fear of losing bargaining power.)
That is, these exploiters will follow the same qualitative argument as us — “if I don’t commit to demand x%, and instead compromise with others’ demands to avoid conflict, I’ll lose bargaining power” — and adopt their own pseudo-deontological constraints against being fair. Seems that adopting your deontological strategy requires assuming one’s bargaining counterparts will be “consequentialists” in a similar way as (you claim) the exploitative strategy requires. And this is why Eliezer’s response to the problem is inadequate.
There might be various symmetry-breakers here, but I’m skeptical they favor the fair/nice agents so strongly that the problem is dissolved.
I think this is a serious challenge and a way that, as you say, an exploitation-resistant strategy might be “wasteful/clumsy/etc., hurting it’s own performance in other ways in order to achieve the no-exploitation property.” At least, unless certain failsafes against miscoordination are used—my best guess is these look like some variant of safe Pareto improvements that addresses the key problem discussed in this post, which I’ve worked on recently (as you know).
Given this, I currently think the most promising approach to commitment races is to mostly punt the question of the particular bargaining strategy to smarter AIs, and our job is to make sure robust SPI-like things are in place before it’s too late.
The second mover ALREADY had the option not to commit—they could just swerve or crash, according to their decision theory.
The premise here is that the second-mover decided to commit soon after the first-mover did, because the proof of the first-mover’s initial commitment didn’t reach the second-mover quickly enough. They could have not committed initially, but they decided to do so because they had a chance of being first.
I’m not sure exactly what you mean by “according to their decision theory” (as in, what this adds here).
if it doesn’t change the sequence of commitment, I don’t see how it makes any difference at all
The difference is that the second-mover can say “oh shit I committed before getting the broadcast of the first-mover’s commitment—I’d prefer to revoke this commitment because it’s pointless, my commitment doesn’t shape the first-mover’s incentives in any way since I know the first-mover will just prefer to keep their commitment fixed.”
As I said, the first-mover doesn’t lose their advantage from this at all, because their commitment is locked (at their freeze time) before the second-mover’s. So they can just leave their commitment in place, and their decision won’t be swayed by the second-mover’s at all because of the rule: “You shouldn’t be able to reveal the final decision to anyone before freeze_time because we don’t want the commitment to get credible before freeze_time.”
better off having a “real” commitment than a revocable commitment that Bob can talk her out of
I’m confused what you mean here. In principle Alice can revoke her commitment before the freeze time in this protocol, but Bob can’t force her to do so. And if it’s common knowledge that Alice’s freeze time comes before Bob’s, then: Since Alice knows that there will be a window after her freeze time where Bob knows Alice’s commitment is frozen, and Bob has a chance to revert, then there would be no reason (barring some other commitment mechanism, including Bob being verifiably updateless while Alice isn’t) for Bob not to revoke (to Swerve) if Alice refused to revert from Dare. So Alice would practically always keep her commitment.
The power to revoke commitments here is helpful in the hands of the second-mover, who made the initial incompatible commitment because of, e.g., some lag time between the first-mover’s making and broadcasting the commitment.
I’d recommend checking out this post critiquing this view, if you haven’t read it already. Summary of the counterpoints:
(Intent) alignment doesn’t seem sufficient to ensure an AI makes safe decisions about subtle bargaining problems in a situation of high competitive pressure with other AIs. I don’t expect the kinds of capabilities progress that is incentivized by default to suffice for us to be able to defer these decisions to the AI, especially given path-dependence on feedback from humans who’d be pretty naïve about this stuff. (C.f. this post—you need the human feedback at bottom to be sufficiently high quality to not get garbage-in, garbage-out problems even if you’ve solved the hard parts of alignment.)
To the extent that solving all of intent alignment is too intractable, focusing on subsets of alignment that are especially likely to avoid s-risks—e.g. preventing AIs from intrinsically valuing frustrating others’ preferences—might be promising. I don’t think mainstream alignment research prioritizes these.
Claims about counterfactual value of interventions given AI assistance should be consistent
A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).
I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of several thought leaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.
It’s plausible (and apparently a reasonably common view among alignment researchers) that:
Aligning models on tasks that humans can evaluate just isn’t that hard, and would be done by labs for the purpose of eliciting useful capabilities anyway; and
If we restrict to using predictive (non-agentic) models for assistance in aligning AIs on tasks humans can’t evaluate, they will pose very little takeover risk even if we don’t have a solution to alignment for AIs at their limited capability level.
It seems that if these claims hold, lots of alignment work would be made obsolete by AIs, not just s-risk-specific work. And I think several of the arguments for humans doing some alignment work anyway apply to s-risk-specific work:
In order to recognize what good alignment work (or good deliberation about reducing conflict risks) looks like, and provide data on which to finetune AIs who will do that work, we need to practice doing that work ourselves. (Christiano here, Wentworth here)
To the extent that working on alignment (or s-risks) ourselves gives us / relevant decision-makers evidence about how fundamentally difficult these problems are, we’ll have better guesses as to whether we need to push for things like avoiding deploying the relevant kinds of AI at all. (Christiano again)
For seeding the process that bootstraps a sequence of increasingly smart aligned AIs, you need human input at the bottom to make sure that process doesn’t veer off somewhere catastrophic—garbage in, garbage out. (O’Gara here.) AIs’ tendencies towards s-risky conflicts seem to be, similarly, sensitive to path-dependent factors (in their decision theory and priors, not just values, so alignment plausibly isn’t sufficient).
I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).
- ^
That post focuses on technical interventions, but a non-technical intervention that seems pretty hard to delegate to AIs is to reduce race dynamics between AI labs, which lead to an uncooperative multipolar takeoff.
- ^
I.e., the hardest part is ensuring the alignment of AIs on tasks that humans can’t evaluate, where the ELK problem arises.
antimonyanthony’s Shortform
primarily because models will understand the base goal first before having world modeling
Could you say a bit more about why you think this? My definitely-not-expert expectation would be that the world-modeling would come first, then the “what does the overseer want” after that, because that’s how the current training paradigm works: pretrain for general world understanding, then finetune on what you actually want the model to do.
“I am devoting my life to solving the most important problems in the world and alleviating as much suffering as possible” fits right into the script. That’s exactly the kind of thing you are supposed to be thinking. If you frame your life like that, you will fit in and everyone will understand and respect what is your basic deal.
Hm, this is a pretty surprising claim to me. It’s possible I haven’t actually grown up in a “western elite culture” (in the U.S., it might be a distinctly coastal thing, so the cliché goes? IDK). Though, I presume having gone to some fancypants universities in the U.S. makes me close enough to that. The Script very much did not encourage me to devote my life to solving the most important problems and alleviating as much suffering as possible, and it seems not to have encouraged basically any of my non-EA friends from university to do this. I/they were encouraged to have careers that were socially valuable, to be sure, but not the main source of purpose in their lives or a big moral responsibility.
A model that just predicts “what the ‘correct’ choice is” doesn’t seem likely to actually do all the stuff that’s instrumental to preventing itself from getting turned off, given the capabilities to do so.
But I’m also just generally confused whether the threat model here is, “A simulated ‘agent’ made by some prompt does all the stuff that’s sufficient to disempower humanity in-context, including sophisticated stuff like writing to files that are read by future rollouts that generate the same agent in a different context window,” or “The RLHF-trained model has goals that it pursues regardless of the prompt,” or something else.
confused claims that treat (base) GPT3 and other generative models as traditional rational agents
I’m pretty surprised to hear that anyone made such claims in the first place. Do you have examples of this?
I think you might be misunderstanding Jan’s understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate’s notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI’s focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.
I agree with your guesses.
I am not sure that “controlling for game-theoretic instrumental reasons” is actually a move that is well defined/makes sense.
I don’t have a crisp definition of this, but I just mean that, e.g., we compare the following two worlds: (1) 99.99% of agents are non-sentient paperclippers, and each agent has equal (bargaining) power. (2) 99.99% of agents are non-sentient paperclippers, and the paperclippers are all confined to some box. According to plenty of intuitive-to-me value systems, you only (maybe) have reason to increase paperclips in (1), not (2). But if the paperclippers felt really sad about the world not having more paperclips, I’d care—to an extent that depends on the details of the situation—about increasing paperclips even in (2).
That wasn’t my claim. I was claiming that even if you’re an “LDT” agent, there’s no particular reason to think all your bargaining counterparts will pick the Fair Policy given you do. This is because:
Your bargaining counterparts won’t necessarily consult LDT.
Even if they do, it’s super unrealistic to think of the decision-making of agents in high-stakes bargaining problems as entirely reducible to “do what [decision theory X] recommends.”
Even if decision-making in these problems were as simple as that, why should we think all agents will converge to using the same simple method of decision-making? Seems like if an agent is capable of de-correlateing their decision-making in bargaining from their counterpart, and their counterpart knows this or anticipates it on priors, that agent has an incentive to do so if they can be sufficiently confident that their counterpart will concede to their hawkish demand.
So no, “committing to act like LDT agents all the time,” in the sense that is helpful for avoiding selection pressures against you, does not ensure you’ll have a decision procedure such that you have no bargaining problems.
I’m confused, the commitment is to act in a certain way that, had you not committed, wouldn’t be beneficial unless you appealed to acausal (and updateless) considerations. But the act of committing has causal benefits.
I agree these are both important possibilities, but:
The reasoning “I see that they’ve committed to refuse high demands, so I should only make a compatible demand” can just be turned on its head and used by the agent who commits to the high demand.
One might also think on priors that some agents might be committed to high demands, therefore strictly insisting on fair demands against all agents is risky.
I was specifically replying to the claim that the sorts of AGIs who would get into high-stakes bargaining would always avoid catastrophic conflict because of bargaining problems; such a claim requires something stronger than the considerations you’ve raised, i.e., an argument that all such AGIs would adopt the same decision procedure (and account for logical causation) and therefore coordinate their demands.
(By default if I don’t reply further, it’s because I think your further objections were already addressed—which I think is true of some of the things I’ve replied to in this comment.)