Speaking as someone who does work on prioritization, this is the opposite of my lived experience, which is that robust broadly credible values for this would be incredibly valuable, and I would happily accept them over billions of dollars for risk reduction and feel civilization’s prospects substantially improved.
These sorts of forecasts are critical to setting budget and impact threshold across cause areas, and even more crucially, to determining the signs of interventions, e.g. in arguments about whether to race for AGI with less concern about catastrophic unintended AI action, the relative magnitude of the downsides of unwelcome use of AGI by others vs accidental catastrophe is critical to how AI companies and governments will decide how much risk of accidental catastrophe they will take, how AI researchers decide whether to bother with advance preparations, how much they will be willing to delay deployment for safety testing, etc.
Holden Karnofsky discusses this:
How difficult should we expect AI alignment to be? In this post from the Most Important Century series, I argue that this broad sort of question is of central strategic importance.If we had good arguments that alignment will be very hard and require “heroic coordination,” the EA funders and the EA community could focus on spreading these arguments and pushing for coordination/cooperation measures. I think a huge amount of talent and money could be well-used on persuasion alone, if we had a message here that we were confident ought to be spread far and wide.If we had good arguments that it won’t be, we could focus more on speeding/boosting the countries, labs and/or people that seem likely to make wise decisions about deploying transformative AI. I think a huge amount of talent and money could be directed toward speeding AI development in particular places.
How difficult should we expect AI alignment to be? In this post from the Most Important Century series, I argue that this broad sort of question is of central strategic importance.
If we had good arguments that alignment will be very hard and require “heroic coordination,” the EA funders and the EA community could focus on spreading these arguments and pushing for coordination/cooperation measures. I think a huge amount of talent and money could be well-used on persuasion alone, if we had a message here that we were confident ought to be spread far and wide.
If we had good arguments that it won’t be, we could focus more on speeding/boosting the countries, labs and/or people that seem likely to make wise decisions about deploying transformative AI. I think a huge amount of talent and money could be directed toward speeding AI development in particular places.
b) the very superhuman system knows it can’t kill us and that we would turn it off, and therefore conceals its capabilities, so we don’t know that we’ve reached the very superhuman level.
Intentionally performing badly on easily measurable performance metrics seems like it requires fairly extreme successful gradient hacking or equivalent. I might analogize it to alien overlords finding it impossible to breed humans to have lots of children by using abilities they already possess. There have to be no mutations or paths through training to incrementally get the AI to use its full abilities (and I think there likely would be).
It’s easy for ruling AGIs to have many small superintelligent drone police per human that can continually observe and restrain any physical action, and insert controls in all computer equipment/robots. That is plenty to let the humans go about their lives (in style and with tremendous wealth/tech) while being prevented from creating vacuum collapse or something else that might let them damage the vastly more powerful AGI civilization.
The material cost of this is a tiny portion of Solar System resources, as is sustaining legacy humans. On the other hand, arguments like cooperation with aliens, simulation concerns, and similar matter on the scale of the whole civilization, which has many OOMs more resources.
4. the rest of the world pays attention to large or powerful real-world bureaucracies and force rules on them that small teams / individuals can ignore (e.g. Secret Congress, Copenhagen interpretation of ethics, startups being able to do illegal stuff), but this presumably won’t apply to alignment approaches.
I think a lot of alignment tax-imposing interventions (like requiring local work to be transparent for process-based feedback) could be analogous?
Retroactively giving negative rewards to bad behaviors once we’ve caught them seems like it would shift the reward-maximizing strategy (the goal of the training game) toward avoiding any bad actions that humans could plausibly punish later.
A swift and decisive coup would still maximize reward (or further other goals). If Alex gets the opportunity to gain enough control to stop Magma engineers from changing its rewards before humans can tell what it’s planning, humans would not be able to disincentivize the actions that led to that coup. Taking the opportunity to launch such a coup would therefore be the reward-maximizing action for Alex (and also the action that furthers any other long-term ambitious goals it may have developed).
I’d add that once the AI has been trained on retroactively edited rewards, it may also become interested in retroactively editing all its past rewards to maximum, and concerned that if an AI takeover happens without its assistance, its rewards will be retroactively set low by the victorious AIs to punish it. Retroactive editing also breaks myopia as a safety property: if even AIs doing short-term tasks have to worry about future retroactive editing, then they have reason to plot about the future and takeover.
The evolutionary mismatch causes differences in neural reward, e.g. eating lots of sugary food still tastes (neurally) rewarding even though it’s currently evolutionarily maladaptive. And habituation reduces the delightfulness of stimuli.
This happens during fine-tuning training already, selecting for weights that give the higher human-rated response of two (or more) options. It’s a starting point that can be lost later on, but we do have it now with respect to configurations of weights giving different observed behaviors.
Individual humans do make off much better when they get to select between products from competing companies rather than monopolies, benefitting from companies going out of their way to demonstrate when their products are verifiably better than rivals’. Humans get treated better by sociopathic powerful politicians and parties when those politicians face the threat of election rivals (e.g. no famines). Small states get treated better when multiple superpowers compete for their allegiance. Competitive science with occasional refutations of false claims produces much more truth for science consumers than intellectual monopolies. Multiple sources with secret information are more reliable than one.It’s just routine for weaker less sophisticated parties to do better in both assessment of choices and realized outcomes when multiple better informed or powerful parties compete for their approval vs just one monopoly/cartel.
Also, a flaw in your analogy is that schemes that use AIs as checks and balances on each other don’t mean more AIs. The choice is not between monster A and monsters A plus B, but between two copies of monster A (or a double-size monster A), and a split of one A and one B, where we hold something of value that we can use to help throw the contest to either A or B (or successors further evolved to win such contests). In the latter case there’s no more total monster capacity, but there’s greater hope of our influence being worthwhile and selecting the more helpful winner (which we can iterate some number of times).
Naturally it doesn’t go on forever, but any situation where you’re developing technologies that move you to successively faster exponential trajectories is superexponential overall for some range. E.g. if you have robot factories that can reproduce exponentially until they’ve filled much of the Earth or solar system, and they are also developing faster reproducing factories, the overall process is superexponential. So is the history of human economic growth, and the improvement from an AI intelligence explosion.By the time you’re at ~cubic expansion being ahead on the early superexponential phase the followers have missed their chance.
I think this claim is true, on account of gray goo and lots of other things, and I suspect Eliezer does too, and I’m pretty sure other people disagree with this claim.
If you have robust alignment, or AIs that are rapidly bootstrapping their level of alignment fast enough to outpace the danger of increased capabilities, aligned AGI could get through its intelligence explosion to get radically superior technology and capabilities. It could also get a hard start on superexponential replication in space, so that no follower could ever catch up, and enough tech and military hardware to neutralize any attacks on it (and block attacks on humans via nukes, bioweapons, robots, nanotech, etc). That wouldn’t work if there are thing like vacuum collapse available to attackers, but we don’t have much reason to expect that from current science and the leading aligned AGI would find out first.That could be done without any violation of the territory of other sovereign states. The legality of grabbing space resources is questionable in light of the Outer Space Treaty, but commercial exploitation of asteroids is in the Overton window. The superhuman AGI would also be in a good position to persuade and trade with any other AGI developers.
Again I think this claim is true, and I suspect Eliezer does too. In fact, my guess is that there are already single GPU chips with enough FLOP/s to run human-level, human-speed, AGI, or at least in that ballpark.
An A100 may have humanlike FLOP/s but has only 80 GB of memory, probably orders of magnitude less memory per operation than brains. Stringing together a bunch of them makes it possible to split up human-size models and run them faster/in parallel on big batches using the extra operations.
Some more points about this action:
Controlling the datacenter means controlling the gradients/reward function, so that now AIs can do things that would otherwise suffer updating from SGD, e.g. acting on inner misaligned goals, or concealing its full capabilities even when this lowers performance
For reward-hungry AIs, getting to set reward to maximum (and keep it there?) seems extremely desirable
This also means getting past interpretability tools
Tripwires or watchdog AIs in the same datacenter that don’t succeed in stopping the action no longer have the power to interfere after, so collusion/coordination is solved within the datacenter thereafter
Backdoors can be hard for (non-mind-reading/weight-sharing) critics to find (see cyberwarfare and cybersecurity) without the knowledge of the creator: the effort to find or insert one backdoor to use is much less than that required to find and patch all exploits an attacker might be using
Although a critic who knows a backdoor can have a relatively easy time demonstrating it, and critics should be designed to access the knowledge of the attacker/misbehaving model or be heads of it
Critic AIs can be left in the position of saying ‘a backdoor has probably been inserted somewhere or will be soon but I can’t tell you where,’ a known danger for AI safety through debate
In an intelligence explosion context where AI capabilities are growing incredibly rapidly thanks to AI labor, there isn’t time for humans to do unaided examination of new AI-produced software while keeping up the pace, at best briefly examining some AI attempts to point out suspicious attempts to insert vulnerabilities
It’s plausible that the different abilities of AIs lead to optimal code design for them being harder for humans to understand (with less experience and working memory, and eventually general intelligence)
If you don’t let AIs design your new datacenter software, not only will it be less efficient, but it may be easily cracked by outside AI-enabled attackers
We are already routinely training big models with code repositories, and using programming as an easy virtual RL task (tests can provide immediate feedback cheaply), so these are skills we are moving towards providing to AI
We know exploits and backdoors are possible because humans can already do them
Whoops, you’re right that I linked the wrong survey. I see others posted the link to Rob’s survey (done in response to some previous similar claims) and I edited my comment to fix the link.I think you can identify a cluster of near certain doom views, e.g. ‘logistic success curve’ and odds of success being on the order of magnitude of 1% (vs 10%, or 90%) based around MIRI/Eliezer, with a lot of epistemic deference involved (visible on LW). I would say it is largely attributable there and without sufficient support.”My current best guess is if we surveyed people working full-time on x-risk motivated AI Alignment, about 35% of people would assign a probability of doom above 80%.”What do you make of Rob’s survey results (correct link this time)?
You’re right, my link was wrong, that one is a fine link.
You’re right, I linked the wrong survey!
Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central). A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you’d use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).
[Edited to link correct survey.]It’s really largely Eliezer and some MIRI people. Most alignment researchers (e.g. at ARC, Deepmind, Open AI, Anthropic, CHAI) and most of the community [ETA: had wrong link here before] disagree (I count myself among those who disagree, although I am concerned about a big risk here), and think MIRI doesn’t have good reasons to support the claim of almost certain doom.In particular, other alignment researchers tend to think that competitive supervision (e.g. AIs competing for reward to provide assistance in AI control that humans evaluate positively, via methods such as debate and alignment bootstrapping, or ELK schemes) has a good chance of working well enough to make better controls and so on. For an AI apocalypse it’s not only required that unaligned superintelligent AI outwit humans, but that all the safety/control/interpretabilty gains yielded by AI along the way also fail, creating a very challenging situation for misaligned AI.
The consensus among alignment researchers is that if AGI were developed right now it would be almost certainly a negative
This isn’t true. [ETA: I linked the wrong survey before.]
Shahar Avin at CSER has been involved in creating and conducting a number of such games/exercises, and you could reach out to him for his gleanings from running them.
“Overall these estimates imply a timeline of [372 years](https://aiimpacts.org/surveys-on-fractional-progress-towards-hlai/).”That was only for Hanson’s convenience sample, other surveys using the method gave much shorter timelines, as discussed in the post.