Erich_Grunewald
In the New York example, it could be that when someone says “Guys, we should really buy those Broadway tickets. The trip to New York is next month already.” they prompt the response “What? I thought we were going the month after!”, hence the disagreement. If this detail had been discussed earlier, there might have been the “February trip” and the “March trip” in order to disambiguate the trip(s) to New York.
I guess I don’t understand what focusing on disagreements adds. Sure, in this situation, the disagreement stems from some people thinking the trip is near (and others thinking it’s farther away). But we already knew that some people think AGI is near and others think it’s farther away! What does observing that people disagree about that stuff add?
What seems to have happened is that people at one point latched on to the concept of AGI, thinking that their interpretation was virtually the same as those of others because of its lack of definition. Again, if they had disagreed with the definition to begin with, they would have used a different word altogether. Now that some people are claiming that AGI is here or here soon, it turns out that the interpretations do in fact differ.
Yeah, I would say that as those early benchmarks (“can beat anyone at chess”, etc.) are achieved without producing what “feels like” AGI, people are forced to make their intuitions concrete, or anyway reckon with their old bad operationalizations of AGI. And that naturally leads to lots of discussion around what actually constitutes AGI. But again, all this is evidence of is that those early benchmarks have been achieved without producing what “feels like” AGI. But we already knew that.
I think that, in your New York example, the increasing disagreement is driven by people spending more time thinking about the concrete details of the trip. They do so because it is obviously more urgent, because they know the trip is happening soon. The disagreements were presumably already there in the form of differing expectations/preferences, and were only surfaced later on as they started discussing things more concretely. So the increasing disagreements are driven by increasing attention to concrete details.
It seems likely to me that the increasing disagreement around AGI is also driven by people spending more time thinking about the concrete details of what constitutes AGI. But where in the New York example we can assume people pay more attention to the details because they know the trip is upcoming, with AGI we know that people don’t know when AGI will happen, so there must be some other reason.
One reason could be “a bunch of people think/feel AGI is near”, but we already knew that before noticing disagreement around AGI. Another reason could be that there’s currently a lot of hype and activity around AI and AGI. But the fact that there’s lots of hype around AI/AGI doesn’t seem like much evidence that AGI is near, or if it is, can also be stated more directly than through a detour via disagreements.
Thanks for writing this—it’s very concrete and interesting.
Have you thought about using company market caps as an indicator of AGI nearness? I would guess that an AI index—maybe NVIDIA, Alphabet, Meta, and Microsoft—would look really significantly different in the two scenarios you paint. To control for general economic conditions, you could look at the those companies relative to the NASDAQ-100 (minus AI companies). An advantage of this is that it tracks a lot of different indicators, including ones that are really fuzzy or hard to discover, through the medium of market speculators. Another advantage is that it is a clear and easily measurable quantity. That makes it easy to make bets and create prediction markets around it.
Of course, there is weirdness around how we should expect the market to behave in the run-up to AGI, where wealth may become less relevant, etc. But I’d still expect the market to be significantly more bullish on AI stocks in the run-up to AGI, than in an AI fizzle/winter scenario.
The jury is still out, but it’s currently available even in Direct Chat on Chatbot Arena, there will be more data on this soon.
Fyi, it’s also available on https://chat.deepseek.com/, as is their reasoning model DeepSeek-R1-Lite-Preview (“DeepThink”). (I suggest signing up with a throwaway email and not inputting any sensitive queries.) From quickly throwing it a few requests I recently asked 3.5 Sonnet, DeepSeek-V3 seems slightly worse, but nonetheless solid.
I’m not totally sure of this, but it looks to me like there’s already more scientific consensus around mirror life being a threat worth taking seriously, than is the case for AI. E.g., my impression is that this paper was largely positively received by various experts in the field, including experts that weren’t involved in the paper. AI risk looks much more contentious to me even if there are some very credible people talking about it. That could be driving some of the difference in responses, but yeah, the economic potential of AI probably drives a bunch of the difference too.
To add to that, Oeberst (2023) argues that all cognitive biases at heart are just confirmation bias based around a few “fundamental prior” beliefs. (A “belief” would be a hypothesis about the world bundled with an accuracy.) The fundamental beliefs are:
My experience is a reasonable reference
I make correct assessments of the world
I am good
My group is a reasonable reference
My group (members) is (are) good
People’s attributes (not context) shape outcomes
That is obviously rather speculative, but I think it’s some further weak reason to think motivated reasoning is in some sense a fundamental problem of rationality.
It seems like an obviously sensible thing to do from a game-theoretic point of view.
Hmm, seems highly contingent on how well-known the gift would be? And even if potential future Petrovs are vaguely aware that this happened to Petrov’s heirs, it’s not clear that it would be an important factor when they make key decisions, if anything it would probably feel pretty speculative/distant as a possible positive consequence of doing the right thing. Especially if those future decisions are not directly analogous to Petrov’s, such that it’s not clear whether it’s the same category. But yeah, mainly I just suspect this type of thing to not get enough attention that it ends up shifting important decisions in the future? Interesting idea, though—upvoted.
Specific examples might include criticisms of RSPs, Kelsey’s coverage of the OpenAI NDA stuff, alleged instances of labs or lab CEOs misleading the public/policymakers, and perspectives from folks like Tegmark and Leahy (who generally see a lot of lab governance as safety-washing and probably have less trust in lab CEOs than the median AIS person).
Isn’t much of that criticism also forms of lab governance? I’ve always understood the field of “lab governance” as something like “analysing and suggesting improvements for practices, policies, and organisational structures in AI labs”. By that definition, many critiques of RSPs would count as lab governance, as could the coverage of OpenAI’s NDAs. But arguments of the sort “labs aren’t responsive to outside analyses/suggestions, dooming such analyses/suggestions” would indeed be criticisms of lab governance as a field or activity.
(ETA: Actually, I suppose there’s no reason why a piece of X research cannot critique X (the field it’s a part of). So my whole comment may be superfluous. But eh, maybe it’s worth pointing out that the stuff you propose adding can also be seen as a natural part of the field.)
Yes, this seems right to me. The OP says
The key point I will make is that, from a game-theoretic point of view, this race is not an arms race but a suicide race. In an arms race, the winner ends up better off than the loser, whereas in a suicide race, both parties lose massively if either one crosses the finish line.
But from a game-theoretic perspective, it can still make sense for the US to aggressively pursue AGI, even if one believes there’s a substantial risk of an AGI takeover in the case of a race, especially if the US acts in its own self interest. Even with this simple model, the optimal strategy would depend on how likely AGI takeover is, how bad China getting controllable AGI first would be from the point of view of the US, and how likely China is to also not race if the US does not race. In particular, if the US is highly confident that China will aggressively pursue AGI even if the US chooses to not race, then the optimal strategy for the US could be to race even if AGI takeover is highly likely.
So really I think some key cruxes here are:
How likely is AGI (or its descendants) to take over?
How likely is China to aggressively pursue AGI if the US chooses not to race?
And vice versa for China. But the OP doesn’t really make any headway on those.
Additionally, I think there are a bunch of complicating details that also end up mattering, for example:
To what extent can two rival countries cooperate while simultaneously competing? The US and the Soviets did cooperate on multiple occasions, while engaged in intense geopolitic competition. That could matter if one thinks racing is bad because it makes cooperation harder (as opposed to being bad because it brings AGI faster).
How (if at all) does the magnitude of the leader’s lead over the follower change the probability of AGI takeover (i.e., does the leader need “room to manoeuvre” to develop AGI safely)?
Is the likelihood of AGI takeover lower when AGI is developed in some given country than in some other given country (all else equal)?
Is some sort of coordination more likely in worlds where there’s a larger gap between racing nations (e.g., because the leader has more leverage over the follower, or because a close follower is less willing to accept a deal)?
And adding to that, obviously constructs like “the US” and “China” are simplifications too, and the details around who actually makes and influences decisions could end up mattering a lot
It seems to me all these things could matter when determining the optimal US strategy, but I don’t see them addressed in the OP.
for people who are not very good at navigating social conventions, it is often easier to learn to be visibly weird than to learn to adapt to the social conventions.
are you basing this on intuition or personal experience or something else? I guess we should avoid basing it on observations of people who did succeed in that way. People who try and succeed in adapting to social conventions are likely much less noticeable/salient than people who succeed at being visibly weird.
Yeah that makes sense. I think I underestimated the extent to which “warning shots” are largely defined post-hoc, and events in my category (“non-catastrophic, recoverable accident”) don’t really have shared features (or at least features in common that aren’t also there in many events that don’t lead to change).
One man’s ‘warning shot’ is just another man’s “easily patched minor bug of no importance if you aren’t anthropomorphizing irrationally”, because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’.
I agree that “warning shot” isn’t a good term for this, but then why not just talk about “non-catastrophic, recoverable accident” or something? Clearly those things do sometimes happen, and there is sometimes a significant response going beyond “we can just patch that quickly”. For example:
The Three Mile Island accident led to major changes in US nuclear safety regulations and public perceptions of nuclear energy
9/11 led to the creation of the DHS, the Patriot Act, and 1-2 overseas invasions
The Chicago Tylenol murders killed only 7 but led to the development and use of tamper-resistant/evident packaging for pharmaceuticals
The Italian COVID outbreak of Feb/Mar 2020 arguably triggered widespread lockdowns and other (admittedly mostly incompetent) major efforts across the public and private sectors in NA/Europe
I think one point you’re making is that some incidents that arguably should cause people to take action (e.g., Sydney), don’t, because they don’t look serious or don’t cause serious damage. I think that’s true, but I also thought that’s not the type of thing most people have in mind when talking about “warning shots”. (I guess that’s one reason why it’s a bad term.)
I guess a crux here is whether we will get incidents involving AI that (1) cause major damage (hundreds of lives or billions of dollars), (2) are known to the general public or key decision makers, (3) can be clearly causally traced to an AI, and (4) happen early enough that there is space to respond appropriately. I think it’s pretty plausible that there’ll be such incidents, but maybe you disagree. I also think that if such incidents happen it’s highly likely that there’ll be a forceful response (though it could still be an incompetent forceful response).
I don’t really have a settled view on this; I’m mostly just interested in hearing a more detailed version of MIRI’s model. I also don’t have a specific expert in mind, but I guess the type of person that Akash occasionally refers to—someone who’s been in DC for a while, focuses on AI, and has encouraged a careful/diplomatic communication strategy.
“Be careful what you say, try to look normal, and slowly accumulate political capital and connections in the hope of swaying policymakers long-term” isn’t an unconditionally good strategy, it’s a strategy adapted to a particular range of situations and goals.
I agree with this. I also think that being more outspoken is generally more virtuous in politics, though I also see drawbacks with it. Maybe I’d wished OP mentioned some of the possible drawbacks of the outspoken strategy and whether there are sensible ways to mitigate those, or just making clear that MIRI thinks they’re outweighed by the advantages. (There’s some discussion, e.g., the risk of being “discounted or uninvited in the short term”, but this seems to be mostly drawn from the “ineffective” bucket, not from the “actively harmful” bucket.)
AI risk is a pretty weird case, in a number of ways: it’s highly counter-intuitive, not particularly politically polarized / entrenched, seems to require unprecedentedly fast and aggressive action by multiple countries, is almost maximally high-stakes, etc.
Yeah, I guess this is a difference in worldview between me and MIRI, where I have longer timelines, am less doomy, and am more bullish on forceful government intervention, causing me to think increased variance is probably generally bad.
That said, I’m curious why you think AI risk is highly counterintuitive (compared to, say, climate change) -- it seems the argument can be boiled down to a pretty simple, understandable (if reductive) core (“AI systems will likely be very powerful, perhaps more than humans, controlling them seems hard, and all that seems scary”), and it has indeed been transmitted like that successfully in the past, in films and other media.
I’m also not sure why it’s relevant here that AI risk is relatively unpolarized—if anything, that seems like it should make it more important not to cause further polarization (at least if highly visible moral issues being relatively unpolarized represent unstable equilibriums)?
That’s one reason why an outspoken method could be better. But it seems like you’d want some weighing of the pros and cons here? (Possible drawbacks of such messaging could include it being more likely to be ignored, or cause a backlash, or cause the issue to become polarized, etc.)
Like, presumably the experts who recommend being careful what you say also know that some people discount obviously political speech, but still recommend/practice being careful what you say. If so, that would suggest this one reason is not on its own enough to override the experts’ opinion and practice.
Everything that happened since then has made it clear that this is not the case; that all these big flashy commitments like Superalignment were just safety-washing and virtue signaling. They were only going to do alignment work inasmuch as that didn’t interfere with racing full-speed towards greater capabilities.
It’s not clear to me that it was just safety-washing and virtue signaling. I think a better model is something like: there are competing factions within OAI that have different views, that have different interests, and that, as a result, prioritize scaling/productization/safety/etc. to varying degrees. Superalignment likely happened because (a) the safety faction (Ilya/Jan/etc.) wanted it, and (b) the Sam faction also wanted it, or tolerated it, or agreed to it due to perceived PR benefits (safety-washing), or let it happen as a result of internal negotiation/compromise, or something else, or some combination of these things.
If OAI as a whole was really only doing anything safety-adjacent for pure PR or virtue signaling reasons, I think its activities would have looked pretty different. For one, it probably would have focused much more on appeasing policymakers than on appeasing the median LessWrong user. (The typical policymaker doesn’t care about the superalignment effort, and likely hasn’t even heard of it.) It would also not be publishing niche (and good!) policy/governance research. Instead, it would probably spend that money on actual PR (e.g., marketing campaigns) and lobbying.
I do think OAI has been tending more in that direction (that is, in the direction of safety-washing, and/or in the direction of just doing less safety stuff period). But it doesn’t seem to me like it was predestined. I.e., I don’t think it was “only going to do alignment work inasmuch as that didn’t interfere with racing full-speed towards greater capabilities”. Rather, it looks to me like things have tended that way as a result of external incentives (e.g., looming profit, Microsoft) and internal politics (in particular, the safety faction losing power). Things could have gone quite differently, especially if the board battle had turned out differently. Things could still change, the trend could still reverse, even though that seems improbable right now.
Attention on AI X-Risk Likely Hasn’t Distracted from Current Harms from AI
Fwiw, there is also AI governance work that is neither policy nor lab governance, in particular trying to answer broader strategic questions that are relevant to governance, e.g., timelines, whether a pause is desirable, which intermediate goals are valuable to aim for, and how much computing power Chinese actors will have access to. I guess this is sometimes called “AI strategy”, but often the people/orgs working on AI governance also work on AI strategy, and vice versa, and they kind of bleed into each other.
How do you feel about that sort of work relative to the policy work you highlight above?
Open Philanthropy did donate $30M to OpenAI in 2017, and got in return the board seat that Helen Toner occupied until very recently. However, that was when OpenAI was a non-profit, and was done in order to gain some amount of oversight and control over OpenAI. I very much doubt any EA has donated to OpenAI unconditionally, or at all since then.
They often do things of the form “leaving out info, knowing this has misleading effects”
On that, here are a few examples of Conjecture leaving out info in what I think is a misleading way.
(Context: Control AI is an advocacy group, launched and run by Conjecture folks, that is opposing RSPs. I do not want to discuss the substance of Control AI’s arguments—nor whether RSPs are in fact good or bad, on which question I don’t have a settled view—but rather what I see as somewhat deceptive rhetoric.)
One, Control AI’s X account features a banner image with a picture of Dario Amodei (“CEO of Anthropic, $2.8 billion raised”) saying, “There’s a one in four chance AI causes human extinction.” That is misleading. What Dario Amodei has said is, “My chance that something goes really quite catastrophically wrong on the scale of human civilisation might be somewhere between 10-25%.” I understand that it is hard to communicate uncertainty in advocacy, but I think it would at least have been more virtuous to use the middle of that range (“one in six chance”), and to refer to “global catastrophe” or something rather than “human extinction”.
Two, Control AI writes that RSPs like Anthropic’s “contain wording allowing companies to opt-out of any safety agreements if they deem that another AI company may beat them in their race to create godlike AI”. I think that, too, is misleading. The closest thing Anthropic’s RSP says is:
However, in a situation of extreme emergency, such as when a clearly bad actor (such as a rogue state) is scaling in so reckless a manner that it is likely to lead to imminent global catastrophe if not stopped (and where AI itself is helpful in such defense), we could envisage a substantial loosening of these restrictions as an emergency response. Such action would only be taken in consultation with governmental authorities, and the compelling case for it would be presented publicly to the extent possible.
Anthropic’s RSP is clearly only meant to permit labs to opt out when any other outcome very likely leads to doom, and for this to be coordinated with the government, with at least some degree of transparency. The scenario is not “DeepMind is beating us to AGI, so we can unilaterally set aside our RSP”, but more like “North Korea is beating us to AGI, so we must cooperatively set aside our RSP”.
Relatedly, Control AI writes that, with RSPs, companies “can decide freely at what point they might be falling behind – and then they alone can choose to ignore the already weak” RSPs. But part of the idea with RSPs is that they are a stepping stone to national or international policy enforced by governments. For example, ARC and Anthropic both explicitly said that they hope RSPs will be turned into standards/regulation prior to the Control AI campaign. (That seems quite plausible to me as a theory of change.) Also, Anthropic commits to only updating its RSP in consultation with its Long-Term Benefit Trust (consisting of five people without any financial interest in Anthropic) -- which may or may not work well, but seems sufficiently different from Anthropic being able to “decide freely” when to ignore its RSP that I think Control AI’s characterisation is misleading. Again, I don’t want to discuss the merits of RSPs, I just think Control AI is misrepresenting Anthropic’s and others’ positions.
Three, Control AI seems to say that Anthropic’s advocacy for RSPs is an instance of safetywashing and regulatory capture. (Connor Leahy: “The primary aim of responsible scaling is to provide a framework which looks like something was done so that politicians can go home and say: ‘We have done something.’ But the actual policy is nothing.” And also: “The AI companies in particular and other organisations around them are trying to capture the summit, lock in a status quo of an unregulated race to disaster.”) I don’t know exactly what Anthropic’s goals are—I would guess that its leadership is driven by a complex mixture of motivations—but I doubt it is so clear-cut as Leahy makes it out to be.
To be clear, I think Conjecture has good intentions, and wants the whole AI thing to go well. I am rooting for its safety work and looking forward to seeing updates on CoEm. And again, I personally do not have a settled view on whether RSPs like Anthropic’s are in fact good or bad, or on whether it is good or bad to advocate for them – it could well be that RSPs turn out to be toothless, and would displace better policy – I only take issue with the rhetoric.
(Disclosure: Open Philanthropy funds the organisation I work for, though the above represents only my views, not my employer’s.)
Some people have looked at this, sorta:
“We [have] a large language model (LLM), GPT-3.5, play two classic games: the dictator game and the prisoner’s dilemma. We compare the decisions of the LLM to those of humans in laboratory experiments. [… GPT-3.5] shows a tendency towards fairness in the dictator game, even more so than human participants. In the prisoner’s dilemma, the LLM displays rates of cooperation much higher than human participants (about 65% versus 37% for humans).”
“In this paper, we examine whether a ‘society’ of LLM agents can learn mutually beneficial social norms in the face of incentives to defect, a distinctive feature of human sociality that is arguably crucial to the success of civilization. In particular, we study the evolution of indirect reciprocity across generations of LLM agents playing a classic iterated Donor Game in which agents can observe the recent behavior of their peers. [...] Claude 3.5 Sonnet reliably generates cooperative communities, especially when provided with an additional costly punishment mechanism. Meanwhile, generations of GPT-4o agents converge to mutual defection, while Gemini 1.5 Flash achieves only weak increases in cooperation.”
“In this work, we investigate the cooperative behavior of three LLMs (Llama2, Llama3, and GPT3.5) when playing the Iterated Prisoner’s Dilemma against random adversaries displaying various levels of hostility. [...] Overall, LLMs behave at least as cooperatively as the typical human player, although our results indicate some substantial differences among models. In particular, Llama2 and GPT3.5 are more cooperative than humans, and especially forgiving and non-retaliatory for opponent defection rates below 30%. More similar to humans, Llama3 exhibits consistently uncooperative and exploitative behavior unless the opponent always cooperates.”
“[W]e let different LLMs (GPT-3, GPT-3.5, and GPT-4) play finitely repeated games with each other and with other, human-like strategies. [...] In the canonical iterated Prisoner’s Dilemma, we find that GPT-4 acts particularly unforgivingly, always defecting after another agent has defected only once. In the Battle of the Sexes, we find that GPT-4 cannot match the behavior of the simple convention to alternate between options.”
I think I’d guess roughly that, “Claude is probably more altruistic and cooperative than the median Western human, most other models are probably about the same, or a bit worse, in these simulated scenarios”. But of course a major difference here is that the LLMs don’t actually have anything on the line—they don’t stand to earn or lose any money, for example, and if they did, they would have nothing to do with the money. So you might expect them to be more altruistic and cooperative than they would under the conditions humans are tested.