I like the “guardian” framing a lot! Besides the direct impact on human flourishing, I think a substantial fraction of x-risk comes from the deployment of superhumanly persuasive AI systems. It seems increasingly urgent that we deploy some kind of guardian technology that at least monitors, and ideally protects, against such superhuman persuaders.
Symbiosis is ubiquitous in the natural world, and is a good example of cooperation across what we normally would consider entity boundaries.
When I say the world selects for “cooperation” I mean it selects for entities that try to engage in positive-sum interactions with other entities, in contrast to entities that try to win zero-sum conflicts (power-seeking).
Agreed with the complicity point—as evo-sim experiments like Axelrod’s showed us, selecting for cooperation requires entities that can punish defectors, a condition the world of “hammers” fails to satisfy.
Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors—by denying them resources, by hardening core infrastructure, via MAD, etc.
I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below:
Both aim to eventually end up in a state of existential security, where nobody can ever build an unaligned AI that destroys the world. Both have to deal with the fact that power is currently broadly distributed in the world, so most plausible stories in which we end up with existential security will involve the actions of thousands if not millions of people, distributed over decades or even centuries.
Pivotal acts have stronger claims of impact, but generally have weaker claims of the sign of that impact—actually realistic pivotal-seeming acts like “unilaterally deploy a friendly-seeming AI singleton” or “institute a stable global totalitarianism” are extremely, existentially dangerous. If someone identifies a pivotal-seeming act that is actually robustly positive, I’ll be the first to sign on.
In contrast, gradual steering proposals like “improve AI lab communication” or “improve interpretability” have weaker claims to impact, but stronger claims to being net positive across many possible worlds, and are much less subject to multi-agent problems like races and the unilateralist’s curse.
True, complete existential safety probably requires some measure of “solving politics” and locking in current human values, hence may not be desirable. Like what if the Long Reflection decides that the negative utilitarians are right and the world should in fact be destroyed? I won’t put high credence on that, but there is some level of accidental existential risk that we should be willing to accept in order to not lock in our values.
You might find AI Safety Endgame Stories helpful—I wrote it last week to try to answer this exact question, covering a broad array of (mostly non-pivotal-act) success stories from technical and non-technical interventions.
Nate’s “how various plans miss the hard bits of the alignment challenge” might also be helpful as it communicates the “dynamics of doom” that success stories have to fight against.
One thing I would love is to have a categorization of safety stories by claims about the world. E.g what does successful intervention look like in worlds where one or more of the following claims hold:
No serious global treaties on AI ever get signed.
Deceptive alignment turns out not to be a problem.
Mechanistic interpretability becomes impractical for large enough models.
CAIS turns out to be right, and AI agents simply aren’t economically competitive.
Multi-agent training becomes the dominant paradigm for AI.
Due to a hardware / software / talent bottleneck there turns out to be one clear AI capabilities leader with nobody else even close.
These all seem like plausible worlds to me, and it would be great if we had more clarity about what worlds different interventions are optimizing for. Ideally we should have bets across all the plausible worlds in which intervention is tractable, and I think that’s currently far from being true.
I don’t mean to suggest “just supporting the companies” is a good strategy, but there are promising non-power-seeking strategies like “improve collaboration between the leading AI labs” that I think are worth biasing towards.
Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn’t delay AI system deployment by at least a few months, and probably more like several years, before capitalist incentives in the form of shareholder lawsuits or new entrants that poach their key technical staff have a chance to materialize.
Interesting, I haven’t seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive. An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.
Fabricated options are products of incoherent thinking; what is the incoherence you’re pointing out with policies that aim to delay existential catastrophe or reduce transaction costs between existing power centers?
I’ve considered starting an org that was either aimed at generating better alignment data or would do so as a side effect and this is really helpful—this kind of negative information is nearly impossible to find.
Is there a market niche for providing more interactive forms of human feedback, where it’s important to have humans tightly in the loop with an ML process, rather than “send a batch to raters and get labels back in a few hours”? One reason RLHF is so little used is the difficulty of setting up this kind of human-in-the-loop infrastructure. Safety approaches like debate, amplification and factored cognition could also become competitive much faster if it was easier and faster to get complex human-in-the-loop pipelines running.
Maybe Surge already does this? But if not, you wouldn’t necessarily want to compete with them on their core competency of recruiting and training human raters. Just use their raters (or Scale’s), and build good reusable human-in-the-loop infrastructure, or maybe novel user interfaces that improve supervision quality.
I think a substantial fraction of ML researchers probably agree with Yann LeCun that AI safety will be solved “by default” in the course of making the AI systems useful. The crux is probably related to questions like how competent society’s response will be, and maybe the likelihood of deceptive alignment.
Two points of disagreement though:
I don’t think setting P(doom) = 10% indicates lack of engagement or imagination; Toby Ord in the Precipice also gives a 10% estimate for AI-derived x-risk this century, and I assume he’s engaged pretty deeply with the alignment literature.
I don’t think P(doom) = 10% or even 5% should be your threshold for “taking responsibility”. I’m not sure I like the responsibility frame in general, but even a 1% chance of existential risk is big enough to outweigh almost any other moral duty in my mind.
Thank you for putting numbers on it!
~60%: there will be an existential catastrophe due to deceptive alignment specifically.
Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?
Agreed with the sentiment, though I would make a weaker claim, that AGI timelines are not uniquely strategically relevant, and the marginal hour of forecasting work at this point is better used on other questions.
My guess is that the timelines question has been investigated and discussed so heavily because for many people it is a crux for whether or not to work on AI safety at all—and there are many more such people than there are alignment researchers deciding what approach to prioritize. Most people in the world are not convinced that AGI safety is a pressing problem, and building very robust and legible models showing that AGI could happen soon is, empirically, a good way to convince them.
Evan’s post argues that if search is computationally optimal (in the sense of being the minimal circuit) for a task, then we can construct a task where the minimal circuit that solves it is deceptive.
This post argues against (a version of) Evan’s premise: search is not in fact computationally optimal in the context of modern tasks and architectures, so we shouldn’t expect gradient descent to select for it.
Other relevant differences are
gradient descent doesn’t actually select for low time complexity / minimal circuits; it holds time & space complexity fixed, while selecting for low L2 norm. But I think you could probably do a similar reduction for L2 norm as Evan does for minimal circuits. The crux is in the premise.
I think Evan is using a broader definition of search than I am in this post, closer to John Wentworth’s definition of search as “general problem solving algorithm”.
Evan is doing worst-case analysis (can we completely rule out the possibility of deception by penalizing time complexity?) whereas I’m focusing on the average or default case.
Agreed with Rohin that a key consideration is whether you are trying to form truer beliefs, or to contribute novel ideas, and this in turn depends on what role you are playing in the collective enterprise that is AI safety.
If you’re the person in charge of humanity’s AI safety strategy, or a journalist tasked with informing the public, or a policy person talking to governments, it makes a ton of sense to build a “good gears-level model of what their top 5 alignment researchers believe and why”. If you’re a researcher, tasked with generating novel ideas that the rest of the community will filter and evaluate, this is probably not where you want to start!
In particular I basically buy the “unique combination of facts” model of invention: you generate novel ideas when you have a unique subset of the collective’s knowledge, so the ideas seems obvious to you and weird or wrong to everyone else.
Two examples from leading scientists (admittedly in more paradigmatic fields):
I remember Geoff Hinton saying at his Turing award lecture that he strongly advised new grad students not to read the literature before trying, for months, to solve the problem themselves.
Richard Hamming in You and Your Research:
If you read all the time what other people have done you will think the way they thought. If you want to think new thoughts that are different, then do what a lot of creative people do—get the problem reasonably clear and then refuse to look at any answers until you’ve thought the problem through carefully how you would do it, how you could slightly change the problem to be the correct one. So yes, you need to keep up. You need to keep up more to find out what the problems are than to read to find the solutions.
Would love to see your math! If L2 norm and Kolmogorov provide roughly equivalent selection pressure that’s definitely a crux for me.
Agreed that the existence of general-purpose heuristic-generators like relaxation is a strong argument for why we should expect to select for inner optimizers that look something like A*, contrary to my gradient descent doesn’t select for inner search post.
Recursive structure creates an even stronger bias toward things like A* but only in recurrent neural architectures (so notably not currently-popular transformer architectures, though it’s plausible that recurrent architectures will come back).
I maintain that the compression / compactness argument from “Risks from Learned Optimization” is wrong, at least in the current ML regime:
In general, evolved/trained/selected systems favor more compact policies/models/heuristics/algorithms/etc. In ML, for instance, the fewer parameters needed to implement the policy, the more parameters are free to vary, and therefore the more parameter-space-volume the policy takes up and the more likely it is to be found. (This is also the main argument for why overparameterized ML systems are able to generalize at all.)
I believe the standard explanation is that overparametrized ML finds generalizing models because gradient descent with weight decay finds policies that have low L2 norm, not low description length / Kolmogorov complexity. See Neel’s recent interpretability post for an example of weight decay slowly selecting a generalizable algorithm over (non-generalizable) memorization over the course of training.
I don’t understand the parameter-space-volume argument, even after a long back-and-forth with Vladimir Nesov here. If it were true, wouldn’t we expect to be able to distill models like GPT-3 down to 10-100x fewer parameters? In practice we see maybe 2x distillation before dramatic performance losses, meaning most of those parameters really are essential to the learned policy.
Overall though this post updated me substantially towards expecting the emergence of inner A*-like algorithms, despite their computational overhead. Added it to the list of caveats in my post.
Yeah it’s probably definitions. With the caveat that I don’t mean the narrow “literally iterates over solutions”, but roughly “behaves (especially off the training distribution) as if it’s iterating over solutions”, like Abram Demski’s term selection.
I disagree that performing search is central to human capabilities relative to other species. The cultural intelligence hypothesis seems much more plausible: humans are successful because our language and ability to mimic allow us to accumulate knowledge and coordinate at massive scale across both space and time. Not because individual humans are particularly good at thinking or optimizing or performing search. (Not sure what the implications of this are for AI).
You’re right though, I didn’t say much about alternative algorithms other than point vaguely in the direction of hierarchical control. I mostly want to warn people not to reason about inner optimizers the way they would about search algorithms. But if it helps, I think AlphaStar is a good example of an algorithm that is superhuman in a very complex strategic domain but is very likely not doing anything like “evaluating many possibilities before settling on an action”. In contrast to AlphaZero (with rollouts), which considers tens of thousands of positions before selecting an action. AlphaZero (just the policy network) I’m more confused about… I expect it still isn’t doing search, but it is literally trained to imitate the outcome of a search so it might have similar mis-generalization properties?