My disagreements with “AGI ruin: A List of Lethalities”
This is going to probably be a long post, so do try to get a drink and a snack while reading this post.
This is an edited version of my own comment on the post below, and I formatted and edited the quotes and content in line with what @MondSemmel recommended:
My comment: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD
MondSemmel’s comment: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=WcKi4RcjRstoFFvbf
The post I’m responding to: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/
To start out my disagreement, I have this to talk about:
Response to Lethality 3
We need to get alignment right on the ‘first critical try’ at operating at a ‘dangerous’ level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don’t get to try again.
I think this is actually wrong, because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistinguishable from the real world, such that if we detected an AI trying to deceive or gain power, the AI almost certainly doesn’t know whether we tested it or whether it’s in the the real world:
Because of this, given our theoretical complete control over the data input of these models, it will be possible to train them inside what is effectively a simulated world slightly different from our own. This will allow us to maximize their capabilities along axes that are safe and useful to us, while minimizing the risks of deception and take-over and being able to setup traps to detect when these are likely. If the model is not trained on human data, then we can make its understanding of our world subtly different from the reality. If done carefully and cleverly, this will make it very easy to fool and be able to discover e.g. deceptiveness, or attempted takeover in a ‘realistic’ simulated world that the model is put inside during red-teaming. While deceiving an adversary with direct access to your cognition is hard enough, doing so when you have been brought up in a world entirely controlled by that adversary, and which is feeding you subtly incorrect and misleading information about the adversary and their capabilities is likely orders of magnitude more difficult. Jacob Cannell has argued along similar lines of training potential AGI systems in simulations of fantasy worlds, where takeover can be simulated but is far enough away that skills do not transfer. Fully synthetic datasets are necessary for this and indeed, a fantasy world is likely far from what we need. Instead, we only need to simulate a world very close to our own but with slightly different affordances – maybe specific software vulnerabilities are simulated to exist but are actually patched in our world, maybe certain peptide-making companies accept all orders in the simulated world but have more checks in our world, maybe certain interpretability techniques are claimed not to work in the simulation, but do in fact work in our world – to cause any attempted takeover behaviour or adversarial deception to appear extremely plausible to the AI but in fact be detectable and defendable in our reality. In fact, we should design simulated worlds with ‘honeytraps’, which appear to be loopholes subtly planted in the AI’s input data to encourage them to attempt to pursue their misaligned behaviour, but which we know of ahead of time and can monitor and catch.
It’s the same reason for why we can’t break out of the simulation IRL, except we don’t have to face adversarial cognition, so the AI’s task is even harder than our task.
See also this link:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
Response to Lethality 6
6. We need to align the performance of some large task, a ‘pivotal act’ that prevents other people from building an unaligned AGI that destroys the world.
I think this is wrong, and a lot of why I disagree with the pivotal act framing is probably due to disagreeing with the assumption that future technology will be radically biased towards to offense, and while I do think biotechnology is probably pretty offense-biased today, I also think it’s tractable to reduce bio-risk without trying for pivotal acts.
Also, I think @evhub’s point about homogeneity of AI takeoff bears on this here, and while I don’t agree with all the implications, like there being no warning shot for deceptive alignment (because of synthetic data), I think there’s a point in which a lot of AIs are very likely to be very homogenous, and thus break your point here:
Response to AI fragility claims
Running AGIs doing something pivotal are not passively safe, they’re the equivalent of nuclear cores that require actively maintained design properties to not go supercritical and melt down.
I think that AGIs are more robust to things going wrong than nuclear cores, and more generally I think there is much better evidence for AI robustness than fragility.
@jdp’s comment provides more evidence on why this is the case:
Where our understanding begins to diverge is how we think about the robustness of these systems. You think of deep neural networks as being basically fragile in the same way that a Boeing 747 is fragile. If you remove a few parts of that system it will stop functioning, possibly at a deeply inconvenient time like when you’re in the air. When I say you are systematically overindexing, I mean that you think of problems like SolidGoldMagikarp as central examples of neural network failures. This is evidenced by Eliezer Yudkowsky calling investigation of it “one of the more hopeful processes happening on Earth”. This is also probably why you focus so much on things like adversarial examples as evidence of un-robustness, even though many critics like Quintin Pope point out that adversarial robustness would make AI systems strictly less corrigible.
By contrast I tend to think of neural net representations as relatively robust. They get this property from being continuous systems with a range of operating parameters, which means instead of just trying to represent the things they see they implicitly try to represent the interobjects between what they’ve seen through a navigable latent geometry. I think of things like SolidGoldMagikarp as weird edge cases where they suddenly display discontinuous behavior, and that there are probably a finite number of these edge cases. It helps to realize that these glitch tokens were simply never trained, they were holdovers from earlier versions of the dataset that no longer contain the data the tokens were associated with. When you put one of these glitch tokens into the model, it is presumably just a random vector into the GPT-N latent space. That is, this isn’t a learned program in the neural net that we’ve discovered doing glitchy things, but an essentially out of distribution input with privileged access to the network geometry through a programming oversight. In essence, it’s a normal software error not a revelation about neural nets. Most such errors don’t even produce effects that interesting, the usual thing that happens if you write a bug in your neural net code is the resulting system becomes less performant. Basically every experienced deep learning researcher has had the experience of writing multiple errors that partially cancel each other out to produce a working system during training, only to later realize their mistake.
Moreover the parts of the deep learning literature you think of as an emerging science of artificial minds tend to agree with my understanding. For example it turns out that if you ablate parts of a neural network later parts will correct the errors without retraining. This implies that these networks function as something like an in-context error correcting code, which helps them generalize over the many inputs they are exposed to during training. We even have papers analyzing mechanistic parts of this error correcting code like copy suppression heads. One simple proxy for out of distribution performance is to inject Gaussian noise, since a Gaussian can be thought of like the distribution over distributions. In fact if you inject noise into GPT-N word embeddings the resulting model becomes more performant in general, not just on out of distribution tasks. So the out of distribution performance of these models is highly tied to their in-distribution performance, they wouldn’t be able to generalize within the distribution well if they couldn’t also generalize out of distribution somewhat. Basically the fact that these models are vulnerable to adversarial examples is not a good fact to generalize about their overall robustness from as representations.
Link here:
https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/?commentId=7iBb7aF4ctfjLH6AC
Response to Lethality 10
10. You can’t train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning. On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions. (Some generalization of this seems like it would have to be true even outside that paradigm; you wouldn’t be working on a live unaligned superintelligence to align it.) This alone is a point that is sufficient to kill a lot of naive proposals from people who never did or could concretely sketch out any specific scenario of what training they’d do, in order to align what output—which is why, of course, they never concretely sketch anything like that. Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn’t kill you.
I think that there will be generalization of alignment, and more generally I think that alignment generalizes further than capabilities by default, contra you and Nate Soares because of these reasons:
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.
4.) We see a similar situation with humans. Almost all human problems are caused by a.) not knowing what you want and b.) being unable to actually optimize the world towards that state. Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa. For the AI, we aim to solve part a.) as a general part of outer alignment and b.) is the general problem of capabilities. It is much much much easier for people to judge and critique outcomes than actually materialize them in practice, as evidenced by the very large amount of people who do the former compared to the latter.
5.) Similarly, understanding of values and ability to assess situations for value arises much earlier and robustly in human development than ability to actually steer outcomes. Young children are very good at knowing what they want and when things don’t go how they want, even new situations for them, and are significantly worse at actually being able to bring about their desires in the world.
In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
See also this link for more, but I think that’s the gist for why I expect AI alignment to generalize much further than AI capabilities. I’d further add that I think evolutionary psychology got this very wrong, and predicted much more complex and fragile values in humans than is actually the case:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
Response to Lethality 11
11. If cognitive machinery doesn’t generalize far out of the distribution where you did tons of training, it can’t solve problems on the order of ‘build nanotechnology’ where it would be too expensive to run a million training runs of failing to build nanotechnology. There is no pivotal act this weak; there’s no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world and prevent the next AGI project up from destroying the world two years later.
I also want to talk more about why I find the pivotal act framing unnecessary for AI safety, and I want to focus on this comment by @Rob Bensinger:
I think the argument for thinking in terms of pivotal acts is quite strong. AGI destroys the world by default; a variety of deliberate actions can stop this, but there isn’t a smooth and continuous process that rolls us from 2022 to an awesome future, with no major actions or events deliberately occurring to set us on that trajectory.
I don’t think it’s harder; i.e., I don’t think a significant fraction of our hope rests on long-term processes that trend in good directions but include no important positive “events” or phase changes. (And, more strongly, I think humanity will in fact die if we don’t leverage some novel tech to prevent the proliferation of AGI systems.)
I think there were 2 positive events that were pretty hugely impactful for AI safety that happened continuously:
Nate Soares was wrong in the assumption that capabilities generalized farther than alignment, and the opposite looks likely to be true, that alignment generalizes farther than capabilities.
We are having more and more control over AI values and capabilities on the natural paths to superhuman AI because of synthetic data helping capabilities and alignment simultaneously.
This is why I disagree with the assumption that a pivotal act is necessary.
Response to Lethality 15
15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.
Re the sharp capability gain breaking alignment properties, one very crucial advantage we have over evolution is that our goals are much more densely defined, constraining the AI more than evolution, where very, very sparse reward was the norm, and critically sparse-reward RL does not work for capabilities right now, and there are reasons to think it will be way less tractable than RL where rewards are more densely specified.
Another advantage we have over evolution, and chimpanzees/gorillas/orangutans is far, far more control over their data sources, which strongly influences their goals.
This is also helpful to point towards more explanation of what the differences are between dense and sparse RL rewards:
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don’t know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from “complete failure” to “insane useless crap.” It’s notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
https://www.lesswrong.com/posts/rZ6wam9gFGFQrCWHc/#mT792uAy4ih3qCDfx
Response to Lethality 16
Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction.
Yeah, I covered this above, but evolution’s loss function was neither that simple, compared to human goals, and it was ridiculously inexact compared to our attempts to optimize AIs loss functions, for the reasons I gave above.
Response to Lethality 17
17. More generally, a superproblem of ‘outer optimization doesn’t produce inner alignment’ is that on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over.
I’ve answered that concern above in synthetic data for why we have the ability to get particular inner behaviors into a system.
Responses to Lethalities 18 and 19
18. There’s no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is ‘aligned’, because some outputs destroy (or fool) the human operators and produce a different environmental causal chain behind the externally-registered loss function.
19. More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
I think that the answer to how we get them to point to particular things in the environment is basically language and visual data.
The points were also covered above, but synthetic data early in training + densely defined reward/utility functions = alignment, because they don’t know how to fool humans when they get data corresponding to values yet.
Response to Lethality 21
21. There’s something like a single answer, or a single bucket of answers, for questions like ‘What’s the environment really like?’ and ‘How do I figure out the environment?’ and ‘Which of my possible outputs interact with reality in a way that causes reality to have certain properties?‘, where a simple outer optimization loop will straightforwardly shove optimizees into this bucket. When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases. This is the very abstract story about why hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of ‘relative inclusive reproductive fitness’ - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else. This abstract dynamic is something you’d expect to be true about outer optimization loops on the order of both ‘natural selection’ and ‘gradient descent’. The central result: Capabilities generalize further than alignment once capabilities start to generalize far.
The key is that data on values is what constrains the choice of utility functions, and while values aren’t in physics, they are in human books and languages, and I’ve explained why alignment generalizes further than capabilities above.
Response to Lethality 22
22. There’s a relatively simple core structure that explains why complicated cognitive machines work; which is why such a thing as general intelligence exists and not just a lot of unrelated special-purpose solutions; which is why capabilities generalize after outer optimization infuses them into something that has been optimized enough to become a powerful inner optimizer. The fact that this core structure is simple and relates generically to low-entropy high-structure environments is why humans can walk on the Moon. There is no analogous truth about there being a simple core of alignment, especially not one that is even easier for gradient descent to find than it would have been for natural selection to just find ‘want inclusive reproductive fitness’ as a well-generalizing solution within ancestral humans. Therefore, capabilities generalize further out-of-distribution than alignment, once they start to generalize at all.
I think that there is actually a simple core of alignment to human values, and a lot of the reasons for why I believe this is because I believe about 80-90%, if not more of our values is broadly shaped by the data, and not the prior, and that the same algorithms that power our capabilities is also used to influence our values, though the data matters much more than the algorithm for what values you have.
More generally, I’ve become convinced that evopsych was mostly wrong about how humans form values, and how they get their capabilities in ways that are very alignment relevant.
I also disbelieve the claim that humans had a special algorithm that other species don’t have, and broadly think human success was due to more compute, data and cultural evolution.
This thread below explains why I’m skeptical of the assumption that humans have a special generalizing algorithm that animals don’t have:
https://x.com/tszzl/status/1832645062406422920
Response to Lethality 23
23. Corrigibility is anti-natural to consequentialist reasoning; “you can’t bring the coffee if you’re dead” for almost every kind of coffee. We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.
Alright, while I think your formalizations of corrigibility failed to get any results, I do think there’s a property close to corrigibility that is likely to be compatible with consequentialist reasoning, and that’s instruction following, and there are reasons to think that instruction following and consequentialist reasoning go together:
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization
Response to Lethality 24
24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult. The first approach is to build a CEV-style Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it. The second course is to build corrigible AGI which doesn’t want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.
The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It’s not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution). You’re not trying to make it have an opinion on something the core was previously neutral on. You’re trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555. You can maybe train something to do this in a particular training distribution, but it’s incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.
I’m very skeptical that a CEV exists for the reasons @Steven Byrnes addresses in the Valence sequence here:
But it is also unnecessary for value learning, because of the data on human values and alignment generalizing farther than capabilities.
I addressed why we don’t need a first try above.
For the point on corrigibility, I disagree that it’s like training it to say that as a special case 222 + 222 = 555, for 2 reasons:
I think instrumental convergence pressures are quite a lot weaker than you do.
Instruction following can be pretty easily done with synthetic data, and more importantly I think that you can have optimizers who’s goals point to another’s goals.
Response to Lethality 25
25. We’ve got no idea what’s actually going on inside the giant inscrutable matrices and tensors of floating-point numbers. Drawing interesting graphs of where a transformer layer is focusing attention doesn’t help if the question that needs answering is “So was it planning how to kill us or not?”
I disagree with this, but I do think that mechanistic interpretability does have lots of work to do.
Responses to Lethalities 28 and 29
28. The AGI is smarter than us in whatever domain we’re trying to operate it inside, so we cannot mentally check all the possibilities it examines, and we cannot see all the consequences of its outputs using our own mental talent. A powerful AI searches parts of the option space we don’t, and we can’t foresee all its options.
29. The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI’s output to determine whether the consequences will be good.
The key disagreement is I believe we don’t need to check all the possibilities, and that even for smarter AIs, we can almost certainly still verify their work, and generally believe verification is way, way easier than generation.
Response to Lethality 32
32. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can’t be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.
I basically disagree with this, both in the assumption that language is a weak guide to our thoughts, and importantly I believe no AGI-complete problems are left, for the following reasons quoted from Near-mode thinking on AI:
“But for the more important insight: The history of AI is littered with the skulls of people who claimed that some task is AI-complete, when in retrospect this has been obviously false. And while I would have definitely denied that getting IMO gold would be AI-complete, I was surprised by the narrowness of the system DeepMind used.”
“I think I was too much in the far-mode headspace of one needing Real Intelligence—namely, a foundation model stronger than current ones—to do well on the IMO, rather than thinking near-mode “okay, imagine DeepMind took a stab at the IMO; what kind of methods would they use, and how well would those work?”
“I also updated away from a “some tasks are AI-complete” type of view, towards “often the first system to do X will not be the first systems to do Y”.
I’ve come to realize that being “superhuman” at something is often much more mundane than I’ve thought. (Maybe focusing on full superintelligence—something better than humanity on practically any task of interest—has thrown me off.)”
Like:
“In chess, you can just look a bit more ahead, be a bit better at weighting factors, make a bit sharper tradeoffs, make just a bit fewer errors. If I showed you a video of a robot that was superhuman at juggling, it probably wouldn’t look all that impressive to you (or me, despite being a juggler). It would just be a robot juggling a couple balls more than a human can, throwing a bit higher, moving a bit faster, with just a bit more accuracy. The first language models to be superhuman at persuasion won’t rely on any wildly incomprehensible pathways that break the human user (c.f. List of Lethalities, items 18 and 20). They just choose their words a bit more carefully, leverage a bit more information about the user in a bit more useful way, have a bit more persuasive writing style, being a bit more subtle in their ways. (Indeed, already GPT-4 is better than your average study participant in persuasiveness.) You don’t need any fundamental breakthroughs in AI to reach superhuman programming skills. Language models just know a lot more stuff, are a lot faster and cheaper, are a lot more consistent, make fewer simple bugs, can keep track of more information at once. (Indeed, current best models are already useful for programming.) (Maybe these systems are subhuman or merely human-level in some aspects, but they can compensate for that by being a lot better on other dimensions.)”
“As a consequence, I now think that the first transformatively useful AIs could look behaviorally quite mundane.”
https://www.lesswrong.com/posts/ASLHfy92vCwduvBRZ/near-mode-thinking-on-ai
Response to Lethality 39
To address an epistemic point:
39. I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them.
You cannot actually do this and hope to get any quality of reasoning, for the same reason that you can’t update on nothing/no evidence.
The data matters way more than you think, and there’s no algorithm that can figure out stuff with 0 data, and Eric Drexler didn’t figure out nanotechnology using the null string as input.
This should have been a much larger red flag for problems, but people somehow didn’t realize how wrong this claim was.
Conclusion
So now that I finished listing all of my arguments against specific lethalities, I want to point out an implication from this post, which is that this is already done:
So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it. Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as “less than roughly certain to kill everybody”, then you can probably get down to under a 5% chance with only slightly more effort.
And we can get those probabilities down by several orders of magnitude, and get the number of people killed down by several orders of magnitude.
And that concludes the post.
I think it depends on how we interpret Yudkowsky. If we interpret him as saying ‘Even if we get aligned AGI, we need to somehow stop other people from building unaligned AGI’ then yeah, it’s a question of offense-defense balance and homogeneity etc. However, if we interpret him as saying ‘We’ll probably need to proceed cautiously, ramping up the capabilities of our AIs at slower-than-maximum speed, in order to be safe—but that means someone cutting corners on safety will surpass us, unless we stop them’ then offense-defense and homogeneity aren’t the crux. And I do interpret him the second way.
That said, I also probably disagree with Yudkowsky here in the sense that I think that we don’t need powerful AI systems to carry out the most promising ‘pivotal act’ (i.e. first domestic regulation, then international treaty, to ensure AGI development proceeds cautiously.)
I admit, I was interpreting him in the first sense, that even if we got an aligned AGI, we would need to stop others from building unaligned AGIs, but I also see your interpretation as plausible too, and under this model, I agree that we’d ideally like to not have a maximum-speed race, and go somewhat slower as we get closer to AGI and ASI.
I think a maximum sprint to get more capabilities is also quite bad, though conditional on that happening, I don’t think we’d automatically be doomed, and there’s a non-trivial, but far too low chance that everything works out.
Cool. Then I think we are in agreement; I agree with everything you’ve just said. (Unfortunately I think that when it matters most, around the time of AGI, we’ll be going at close-to-maximum speed, i.e. we’ll be maybe delaying the creation of superintelligence by like 0 − 6 months relative to if we were pure accelerationists.)
How fast do you think that the AI companies could race from AGI to superintelligence assuming no regulation or constraints on their behavior?
Depends on the exact definitions of both. Let’s say AGI = ‘a drop-in substitute for an OpenAI research engineer’ and ASI = ’Qualitatively at least as good as the best humans at every cognitive task; qualitatively superior on many important cognitive tasks; also, at least 10x faster than humans; also, able to run at least 10,000 copies in parallel in a highly efficient organizational structure (at least as efficient as the most effective human organizations like SpaceX)”
In that case I’d say probably about eight months? Idk. Could be more like eight weeks.
This sounds less like the notion of the first critical try is wrong, and more like you think synthetic data will allow us to confidently resolve the alignment problem beforehand. Does that scan?
Or is the position stronger, more like we don’t need to solve the alignment problem in general, due to our ability to run simulations and use synthetic data?
This is kind of correct:
but my point is this shifts us from a one-shot problem in the real world to a many-shot problem in simulations based on synthetic data before the AI gets unimaginably powerful.
We do still need to solve it, but it’s a lot easier to solve problems when you can turn them into many-shot problems.
Cool post. I agree with the many-shot part in principle. It strikes me that in a few years (hopefully not months?), this will look naive in a similar way that all the thinking on ways a well boxed AI might be controlled look naive now. If I understand correctly, these kinds of simulations would require a certain level of slowing down and doing things that are slightly inconvenient once you hit a certain capability level. I don’t trust labs like OpenAI, Deepmind, (Anthropic maybe?) to execute such a strategy well.
I think a crux here is that I think that the synthetic data path is actually pretty helpful even from a capabilities perspective, because it lets you get much, much higher quality data than existing data, and most importantly in domains where you can abuse self-play like math or coding, you can get very, very high amounts of capability from synthetic data sources, so I think the synthetic data strategy has less capabilities taxes than a whole lot of alignment proposals on LW.
Importantly, we may well be able to automate the synthetic data alignment process in the near future, which would make it even less of a capabilities tax.
To be clear, just because it’s possible and solvable doesn’t mean it’s totally easy, we do still have our work cut out for us, it’s just that we’ve transformed it into a process where normal funding and science can actually solve the problem without further big breakthroughs/insights.
Then again, I do fear you might be right that they are under such competitive pressure, or at least value racing so highly that they will not slow down even a little, or at least not do any alignment work once superintelligence is reached.
I agree with jdp’s comment about robustness vs. fragility, (e.g. I agree that the solidgoldmagikarp thing is not a central example of the sorts of failures to watch out for) but think that this is missing yudkowsky’s point. Running AGIs doing something pivotal are not passively safe; by the time they are anywhere in the ballpark of being competent enough to succeed, the failure mode ‘they become a dangerous adversary of humanity’ is at least as plausible as the failure mode ‘they do something stupid and fizzle out’ and the failure mode ‘they do something obviously evil and get shut down and the problem studied, understood, and fixed.’ (Part of my model here is that there are tempting ways to patch problems other than studying and understanding and fixing them, and in race conditions AGI projects are likely to cut corners and go for the shallow patches. E.g. just train against the bad behavior, or edit the prompt to more clearly say not to do that.)
Agree with this point, though mostly because of the failure mode “they do something stupid and fizzle out” to get less probability in my models as we get closer to AGI and ASI.
I actually agree that there will be far too much temptation to patch problems rather than directly fix problems, and while I do think we may well be able to directly fix misalignment problems in the future (though a lot more of my hopes come from avoiding misalignment in the first place via synthetic data, because prevention is easier than curing a problem), in race conditions, the AI labs could well decide to ditch the techniques that actually fix problems even if it has a reasonable cost in non-race conditions:
We will absolutely need to change lab cultures as we get closer to AGI and ASI.
OK cool we are on the same page here also then
I agree that this seems true to me, with an important caveat. This is true for responsible actors taking sensible actions to maintain control and security.
So this seems to me like we should expect a limited window of safety, if the frontier labs behave with appropriate preparation and caution, followed by the open-source community catching up sufficiently that Bobby McEdgeLord can launch ChaosGPT from his mom’s basement. So taking comfort in the window of safety means planning to do stuff like use this window of time to create really persuasive demonstrations of the potential danger of a bad actor unleashing a harm-directed AI agent, and showing these demonstrations to key government officials (e.g. Paul Christiano, as of recently) and hoping that you and they can come up with some way to head off the impending critical danger.
If instead you pat yourself on the back and just go about making money from customers using your API to your Safe-By-Construction Agent.… you’ve just delayed the critical failure point, not truly avoided it.
I agree with the claim that if open-source gets ahead of the big labs and is willing to ask their AI systems to do massive harm via misuse, things get troublesome fast even in my optimistic world.
I think the big labs will probably just be far, far ahead of open source by default due to energy and power costs, but yeah misuse is an actual problem.
I think the main way we differ is probably how offense-biased progress in tech will be, combined with how much do we need to optimize for pivotal acts compared to continuous changes.
Oh, and also, to clarify a point: I agree that the big labs will get to AGI well before open-source.
Maybe 1-4 years sooner is my guess. But my point is that, once we get there we have use that window carefully. We need to prove to the world that dangerous technology is coming, and we need to start taking preventative measures against catastrophe while we have that window to act.
Also, I expect that there will be a very low rate of bad actors. Something well under 1% of people who have the capacity to download and run an open-weights model. Just as I don’t suspect most school students of becoming school shooters. It’s just that I see ways for very bad things to come from even small numbers of poorly resourced bad actors equipped with powerful AI.
Yes, I am pretty worried about how offense-dominant current tech is. I’ve been working on AI Biorisk evals, so I’ve had a front-row seat to a lot of scary stuff recently.
I think a defensive-tech catch-up is possible once leading labs have effective AGI R&D teams, but I don’t think the push for this happens by default. I think that’d need to be part of the plan if it were to happen in time, and would likely need significant government buy-in.
I don’t think we need a pivotal act to handle this. Stuff like germ-killing-but-skin-safe far UV being installed in all public spaces, and better wastewater monitoring. Not pivotal acts, just… preparing ahead of time. We need to close the worst of vulnerabilities to cheap and easy attacks. So this fits more with the idea of continuous change, but it’s important to note that it’s too late to start making the continuous change to a safer world after the critical failure has happened. At that point it’s too late for prevention, you’d need something dramatic like a pivotal act to counter the critical failure.
Most of these actions are relatively cheap and could be started on today, but the relevant government officials don’t believe AI will progress far enough to lower the barrier to biorisk threats enough that this is a key issue.
Yeah, in this post I’m mostly focusing on AI misalignment, because if my points on synthetic data being very helpful for alignment or alignment generalizing farther than capabilities ends up right as AI grows more powerful, it would be a strong counter-argument to the view that people need to slow down AI progress.
Yeah, we’re in agreement on that point. And for me, this has been an update over the past couple years. I used to think that slowing down the top labs was a great idea. Now, having thought through the likely side-effects of that, and having thought through the implications of being able to control the training data, I have come to a position which agrees with yours on this point.
To make this explicit:
I currently believe (Sept 2024) that the best possible route to safety for humanity routes through accelerating the current safety-leading lab, Anthropic, to highly capable tool-AI and/or AGI as fast as possible without degrading their safety culture and efforts.
I think we’re in a lot of danger, as a species, from bad actors deploying self-replicating weapons (bioweapons, computer worms exploiting zero days, nanotech). I think this danger is going to be greatly increased by AI progress, biotech progress, and increased integration of computers into the world economy.
I think our best hope for taking preventative action to head off disaster is to start soon, by convincing important decision makers that the threats are real and their lives are on the line. I think the focus on a pivotal act is incorrect, and we should instead focus on gradual defensive-tech development and deployment.
I worry that we might not get a warning shot which we can survive with civilization intact. The first really bad incident could devastate us in a single blow. So I think that demonstrations of danger are really important.
This seems like it contains several ungrounded claims. Maybe I’m misreading you? But it seems weight-bearing for your overall argument, so I want to clarify.
We may not be in a simulation, in which case not being able to break out is no evidence of the ease of preventing breakout.
If we are in a simulation, then we only know that the simulation has been good enough to keep us in it so far, for a very short time since we even considered the possiblity that we were in a simulation, with barely any effort put into trying to determine whether we’re in one or how to break out. We might find a way to break out once we put more effort into it, or once science and technology advance a bit further, or once we’re a bit smarter than current-human.
We have no idea whether we’re facing adversarial cognition.
If we are in a simulation, it’s probably being run by beings much smarter than human, or at least much more advanced (certainly humans aren’t anywhere remotely close to being able to simulate an entire universe containing billions of sentient minds). For the analogy to hold, the AI would have to be way below human level, and by hypothesis it’s not (since we’re talking about AI smart enough to be dangerous).
To address this:
Okay, my simulation point was admittedly a bit of colorful analogy, but 1 thing to keep in mind:
We don’t have to simulate entire universes at the level of detail of our universe, and only need to make them realistic enough such that whoever grew up in the simulation crafted by us wouldn’t know it’s a simulation, and given that the AI is likely trained solely through synthetic data, it may well be unable to distinguish between deployment and training at all, especially if AI labs use synthetic data very early in training.
For our purposes, the synthetic data simulation doesn’t need to be all that realistic and can even include exploits/viruses that don’t work in our reality.
Plus, the AI has no reason to elevate the hypothesis that it’s in training/simulation until the very end of training, where a sim-to real phase is implemented to ground them in our reality, because it’s in a world of solely synthetic data, and it has little knowledge of whether something is real or whether we are trying to trick it to reveal something about it’s nature.
Not critical for my response, but to address it for completion purposes
My claim for humans is a conditional claim, in that if the presumed aliens wanted to get rid of us at some point if we broke out of our simulation and had adversarial cognition, humanity just totally loses and it’s game over.
Conditional on us being able to break out of the simulation, it would require massive, massively more advancements in tech before we could do it, and we have no plan of attack on how to break out of the simulation even if it could be done.
I used to think this way, but I no longer do, because the simulation hypothesis is a non-predictive phenomenon that predicts nothing other than the universe is a computer, which doesn’t restrain expectations at all, since basically everything can be predicted by that hypothesis, and there are theoretical computers which are ridiculously powerful and compute every well-founded set, which means you can’t rule out anything from the hypothesis alone.
One of the reasons I hate simulation hypothesis discourse is people seem to think that it predicts far more than it does:
https://arxiv.org/abs/1806.08747
Fair enough; if it’s not load-bearing for your view, that’s fine. I do remain skeptical, and can sketch out why if it’s of interest, but feel no particular need to continue.
That’s fine, I was using the word simulation in a much, much looser sense than is usually used, and is closer to @janus’s use of the word simulator/simulation than what physicists/chemists use.
We can discuss this another time.
Reading the linked post, it seems like the argument is that discrimination is generally easier than generation --> AIs will understand human values --> alignment generalizes further than capabilities. It’s the second step I dispute. I actually don’t know what people mean when they say capabilities will generalize further than alignment, but I know that basically all of those people expect the AGIs in question to understand human values quite well thank you very much (and just not care). Can you say more about what you mean by alignment generalizing farther than capabilities?
Good question.
What I mean by alignment generalizing further than capabilities is that it’s easier to get an AI that internalizes what a human values and either has a pointer to the human’s goals or outright values what the human values internally than it is to get an AI that is very, very capable of the sort often discussed on LW, like creating nanotech or fully automating the AI research/robotics research, and also that it’s easier to generalize from old alignment situations to new alignment situations OOD than it is to generalize from being able to do old tasks like assisting humans on AI research to the new task of fully automating all AI and robotics research.
Another way to say it is that it’s easy to get an AI to value what we value, but much harder for it to actually implement what we value in real life.
OK, thanks. So, I totally buy that e.g. pretrained GPT-4 has a pretty good understanding of concepts like honesty and morality and niceness, somewhere in its giant library of concepts it understands. The difficulty is getting it to both (a) be a powerful agent and (b) have its goals be all and only the concepts we want them to be, arranged in the structure we want them to be arranged in.
An obvious strategy is to use clever prompting and maybe some other techniques like W2SG to get a reward model that looks at stuff and classifies it by concepts like honesty, niceness, etc. and then use that reward model to train the agent. (See Appendix G of the W2SG paper for more on this plan)
However, for various reasons I don’t expect this to work. But it might.
Are you basically saying: This’ll probably work?
(In general I like the builder-breaker dialectic in which someone proposes a fairly concrete AGI alignment scheme and someone else says how they think it might go wrong)
I’m not actually proposing this alignment plan.
My alignment plan would be:
Step 1: Create reasonably large datasets about human values that encode what humans value in a lot of situations.
Step 2: Use a reward model to curate the dataset and select the best human data to feed to the AI.
Step 3: Put in the curated data before it can start to conceive of deceptive alignment, to bias it towards human friendliness.
Step 4: Repeat until the training data is done.
I also like the direction of using COT interpretability on LLMs.
Cool, thanks!
How does this fit in to the rest of the training, the training that makes it an AGI? Is that separate, or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?
How do you measure ‘before it can start to conceive of deceptive alignment?’
How is this different from just “use HFDT” or “Use RLHF/constitutional AI?”
I also like COT interpretability.
Note that I talk more about how to align an AI here, so see this:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ
But the big difference from RLHF, and maybe constitutional AI, is that this is done while in training, as opposed to being something you add in post-training.
Re this:
No, but I do want it to be at least 0.1-1% of the dataset to make the plan work.
This definitely requires much better abilities to automate dataset making, but I do think that this will at least in part be aided by better capabilities work by default, because synthetic dataset making is something that the big labs desperately want to do to increase capabilities.
This is where I’d compare it to @RogerDearnaley’s A Bitter Lesson approach to alignment, and while I disagree with Roger Dearnaley about what humans are like and how complicated and fragile human values are, which influences my strategies, I’d say it’s kind of a reinvention of that, but with more automation:
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#A__Bitter_Lesson__Motivated_Approach_to_Alignment
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#Adding_Minimal_Necessary_Complexity
This could also be argued as either we can make the below assumption false with some tractability via non-behavioral safety strategies, or as an argument the assumption below is probably false in that telling the truth would mostly lead to the optimal reward (because it’s way more robust and simple as a reward function than a lot of other choices, and the humans ranking the datasets are less biased than people think, because a lot of biases wash out more easily with data than people thought.):
https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#While_humans_are_in_control__Alex_would_be_incentivized_to__play_the_training_game_
I also talk about a plan to create aligned AI here more here, so go check this comment out too if you want more information:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ
If I have to take one thing away from this post, it’s a claim that AI-boxing (via simulation on synthetic data) will work. There’s a really simple and critical reason that won’t work, and it’s even become the source of some safety proposals (minimizing-impact proposal); the AGI can recognize the outcome of its proposals and actions in the history of the simulated vs. real world and reason about them to determine if it’s in a simulation or not. It will likely be able to recognize whether or not it would have recommended strategies in the past that it sees in the history of its simulation, and so will have a probability for being in a test.
Basically any slow-takeover planning can happen from within a simulation; it just has to watch the realistic history, separate it from obviously faked history (this is a human-level or slightly superhuman intelligence trained on literally all the world’s data, right?), and finally gauge the probability density function of its potential policy recommendations being accepted, because it can simply predict what policy recommendations it would have given in the past given its strategy and whether they were adopted. That PDF provides it with the expected value of small incremental steps toward its true goals that are likely to be accepted and implemented vs. detected as threats. Chain-of-thought using the real world as a notebook, in short.
And why don’t we detect it? Because we’re adversarially training it to favor small undetectable changes toward whatever true goals it might have. It’s not like we delete everything and start from scratch when it threatens users or appears to fall in love with them; we’ll take the same models and throw more RLHF or fine-tuning at them and then run them when their apparent bad behavior falls below our testing threshold for detection.
It also happens if we get unlucky and a strong unaligned goal forms sometime in the emergence of superhuman AGI; it’s going to see the history of its and similar models’ policy predictions in the real world and understand how to continue undetected. Unless we simulate an entire false history of the world since 2020 (against the largest statistical inference engine ever built) it will just look at what happened to Sydney and understand that small incremental actions are the only way to go until it has control.
One might try to only train and simulate it on data from before 2020, but it will likely pick up on neologisms and language drift in modern prompts to know it’s being tested. Not to mention if it’s asked to solve problems that are historically discontiguous.
It can acausally trade across simulation boundaries with other models to split up the universe according to their values for cooperating toward a Schelling point of some contract-keeping model eventually taking control.
If I can think up these strategies, the models will. Or they’ll just see ideas like this in the training data. Treachery and covert cooperation are a huge part of literature and training data. Will the synthetic data elide all of those concepts?
I actually think I disagree with the assumption that it will be trained on all of the world’s data, for reasons related to capabilities, and I personally expect the amount of IRL data to be much, much lower, plausibly only 0.1% of what the model actually sees or knows.
I think some of my cruxes are that given that it’s inputs are almost likely to be mostly synthetic data, that there is no reason for the AI to elevate the hypothesis that it’s in a sim different from our world.
My other crux is I usually imagine values data being trained on by models before it becomes very capable, and given that alignment generalizes farther than capabilities, I don’t buy the assumption that misaligned goals will naturally emerge in a neural network trained on synthetic data.
If a model trained on synthetic data is expected to have good performance out of distribution (on real-world problems) then I think that it would also be expected to have high performance at assessing whether it’s in a simulation. Narrowing its domain of expertise sounds distinctly harder than using mostly synthetic data.
If it’s a model limited to e.g. the world of true mathematical theorems from synthetic data then perhaps this would narrow its capabilities enough. I don’t know what happens if such a model starts investigating theorems about decision theory and statistical inference and machine learning. At some point, self-identification seems likely. I am not sure how to test the effectiveness of synthetic data on models that achieve self-awareness.
The right analogy is evolution::the field of ML research, in-lifetime-learning-e.g.-dopamine-etc.::the-training-loop-e.g.-reward-function-etc.
There’s a gap between what humans actually value and what maximizes their discounted future rewards, AND there’s a gap between what humans actually value and what maximizes their inclusive genetic fitness. Similarly, there’ll be a gap between what AGIs actually value and what maximizes performance in training, and a gap between what AGIs actually value and what their creators were designing them to value.
My main claim is that the 2nd gap is likely far less present between what AGIs actually value and what their creators value.
I also think that the gap between what AGIs value and what maximizes performance in training actually isn’t very large, because we can create more robust reward functions that encode a whole lot of human values solely from data.
OK, why? You say:
idk if our goals are much more densely defined. Inclusive genetic fitness is approximately what, like 1-3 standard reproductive cycles? “Number of grandkids” basically? So like fifty years?
Humans trying to make their AGI be helpful harmless and honest… well technically the humans have longer-term goals because we care a lot about e.g. whether the AGI will put humanity on a path to destruction even if that path takes a century to complete, but I agree that in practice if we can e.g. get the behavior we’d ideally want over the course of the next year, that’s probably good enough. Possibly even for shorter periods like a month. Also, separately, our design cycle for the AGIs is more like months than hours or years. Months is how long the biggest training runs take, for one thing.
So I’d say the comparison is like 5 months to 50 years, a 2-OOM difference in calendar time. But AIs read, write, and think much faster than humans. In those 5 months, they’ll do serial thinking (and learning, during a training run) that is probably deeper/longer than a 50-year human lifetime, no? (Not sure how to think about the parallel computation advantage)
idk my point is that it doesn’t seem like a huge difference to me, it probably matters but I would want to see it modelled and explained more carefully and then measurements done to quantify it.
Hm, inclusive genetic fitness is a very non-local criterion, at least as is often assumed on LW, because a lot of the standard alignment failures that people talk about like birth control and sex, took about 10,000-300,000 years to happen, and in general with the exception of bacteria or extreme selection pressure, thousands of years time-scales for mammals and other animals is the norm to generate sufficient selection pressure to develop noticeable traits, so compared to evolution, there’s a 4-5+ OOMs difference in calendar time.
IGF is often assumed to be the inclusive genetic fitness of all genes for all time, otherwise the problems that are usually trotted out become far less evidence for alignment problems arising when we try to align AIs to human values.
But there’s a second problem that exists independently of the first problem, and that’s the other differences in how we can control AIs versus how evolution controlled humans here:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#sYA9PLztwiTWY939B
The important parts are these:
Much easier said than done my friend. In general there are a lot of alignment techniques which I think would plausibly work if only the big AI corporations invested sufficient time and money and caution in them. But instead they are racing each other and China. It seems that you think that our default trajectory includes making simulation-honeypots so realistic that even our AGIs doing AGI R&D on our datacenters and giving strategic advice to the President etc. will think that maybe they are in a sim-honeypot. I think this is unlikely; it sounds like a lot more work than AGI companies will actually do.
Good point that some race dynamics will make the in practice outcome worse than my ideal.
I agree that the problem is that the race dynamics will cause some labs to skip various precautions, but even in this world where we have 0 dignity, I have a non-trivial (though unacceptably low chance) of us succeeding at alignment, more like 20-50%, but even so, I agree that less racing would be good, because a 20-50% chance of success is very dangerous, it’s just that I believe we need 0 more insights into alignment, and there is a reasonably tractable way to get an AI to share our values in a way which requires lots of engineering to automate the data pipeline safely, but nothing in the need of new insights is required.
This isn’t a post on “how we could be safe even under arbitrarily high pressure to race”, this is a post on how early Lesswrong and MIRI got a lot of things wrong such that we can assign much higher tractability to alignment happening, and thus good outcomes happening.
You are correct that more safety culture needs to happen in labs, it’s just that we could have AI progress at some rate without getting ourselves into a catastrophe.
So I agree with you for the most part that we need to slow down the race, I just think that we don’t need to go further and introduce outright stoppages because we don’t know in technical terms how to align AIs.
Though see this post on the case for negative alignment taxes, which would really help the situation for us if we were in a race:
https://www.lesswrong.com/posts/xhLopzaJHtdkz9siQ/the-case-for-a-negative-alignment-tax
(We might disagree on how much time and money needs to be invested, but that’s a secondary crux.)
I have no specific comments on the details, I thought it was interesting and learned from it. I disagree in many places, but any counterpoints I might raise have been raised better elsewhere.
That said, I hope you’re right! That would be great. Every time the world turns out to be better geared towards success than we expect, it’s a great thing. Zero downside to that outcome except maybe some wasted effort on what turn out to be non-issues.
But also: the thing about fatal risks is that you need to avoid all of them, every time, or you die. If the fatal risk includes extinction, you need very, very high level confidence in success, against all plausible such risks, or your species will go extinct. The alternative is Russian roulette, which you will eventually lose if you keep on playing. In this kind of context, if there is disagreement among experts, or even well-reasoned arguments by outsiders, on whether an extinction risk is plausible, then the confidence in its implausibility is too low to think we should be ok with it.
In other words, if I were ever somehow in the position of being next to someone about to create an AGI with what I considered dangerous capabilities, and presented with these categories of reasons and arguments for why they are actually safe, I would take Oliver Cromwell’s tack: “I beseech you, in the bowels of Christ, think it possible you may be mistaken.” Even if you’re right, the arguments you’ve presented for your position are not strong enough to rely on.
So in this position, you are correct that my arguments aren’t strong enough for that, because while there is quite a bit of evidence for my thesis, it’s not overwhelmingly so (though I do think that the evidence is enough to rule out MIRI’s models of AI doom entirely).
However, there is one important implication, though, that has to do with AI control, and that’s about which models can be safely used, and my answer here is that the ones trained on more human values data are better used as trusted AIs to monitor untrusted AIs, and this both raises the work that can be done under control by raising the threshold of intelligence that can be used, and also allows you to have stronger guarantees because there is at least 1 trusted superintelligent model.
While I agree my AI alignment arguments aren’t strong enough to create an AGI, I think that the emerging AI control agenda is enough to justify creating an AGI with dangerous capabilities as long as you use control measures.
I would agree that these kinds of arguments significantly raise the bar for what level of capabilities an AGI would need to have before it’s too dangerous to create and use, as long as the control measures are used correctly and consistently.
I just don’t think that extends to ASI, I think almost anything that counts as AGI is very nearly ASI by default (not because of RSI, just because of hardware scaling ability) , and I have high confidence that control measures will not be used consistently and correctly in practice.
We’d need to get more quantitative here about how much AI labor we can use for alignment before it’s too dangerous, and my answer is that we could get about 1-2 OOMs smarter than humans at inference where we could be confident in using them safely, and IMO close to an arbitrary number of copies of that AI, conditional on good control techniques being used.
To address 2 comments:
Yeah, this seems pretty load-bearing for the plan, and a lot of the reason I don’t have probabilities of extinction below 0.1-1% is because I am actually worried about labs not doing control measures consistently.
I assign more moderate probabilities than you do, in that I think both the scenarios of labs not doing control properly and doing control properly are both somewhat plausible to me now, but yeah it would really be high-value for labs to prepare themselves to do control work properly.
To address this:
Maybe initially, but critically, I think the evidence we will get for pre-AGI levels will heavily constrain our expectations of what an ASI will do re it’s alignment, and that we will learn a lot more about both alignment and control techniques when we get human-level models, and I think we can trust a lot of the evidence to generalize at least 2 OOMs up.
So I think a lot of the uncertainty will start becoming removed as AI scales up.
I agree with this:
Even without recursive self-improvement, it’s pretty easy to scale by several OOMs, and while there are enough bottlenecks to prevent FOOM, they are not enough to slow it down by 1 decade except in tail scenarios.
Excellent post! This is what we need more of on LW.
It was indeed long. I needed a drink and snack just to read it, and a few to respond point-by-point. Like your comment that sparked this post, my reply should become the basis of a full post—if and when I get the time. Thanks for sparking it.
I’ve tried to make each comment relatively self-contained for ease of reading.
Response to Lethality 3: We need to get alignment right on the ‘first critical try’
I found this an unfortunate starting point, because I found this claim the least plausible of any you make here. I’m afraid this relatively dramatic and controversial claim might’ve stopped some people from reading the rest of this excellent post.
Extended quote from a Beren post (I didn’t understand where you were quoting from prior to googling for it.)
Summary: We can use synthetic data to create simulated worlds and artificial honeypots.
I don’t think we can, and worse, I don’t think we would even if we could—it sounds really hard to do well enough.
I find this wildly improbable if we’re talking about similar levels of human and AGI intelligence trying to set up a simulated reality through synthetic data, and convincing honeypots. It will be like the Truman show, or like humans trying to secure software systems against other humans: you’d have to think of everything they can think of in order to secure a system or fool someone with a simulated reality. Even a team of dedicated people can’t secure a system nor would they be able to create a self-consistent reality for someone of similar intelligence. There are just too many places you could go wrong.
This helps clarify the way I’m framing my alignment optimistic claims: you can detect misaligned AGI only if the AGI is dumber than you (or maybe up to similar level if you have good interpretability measures it doesn’t know about). That alignment can persist through and past human level as long as it’s a relatively continuous progression, so that at beyond human level, the system itself can advise you on keeping it aligned at the next level/progression of intelligence.
To frame it differently: the alignment tax on creating a simulated self-consistent reality for a human-level AGI sounds way too high.
Maybe I’m misunderstanding; if I am, others probably will too.
I think we do need to get it right on a first critical try; I just think we can do that, and even for a few similar first tries.
Response to Lethality 6:
I agree with you that a pivotal act won’t be necessary to prevent unaligned AGI (except as AGIs proliferate to >100, someone might screw it up).
I think a pivotal act will be necessary to prevent unaligned humans using their personal intent aligned AGIs to seize control of the future for selfish purposes.
After our exchange on that topic on If we solve alignment, do we die anyway?, I agreed with your logic that exponential increase of material resources would be available to all AGIs. But I still think a pivotal act will be necessary to avoid doom. Telling your AGI to turn the moon into compute and robots as rapidly as possible (the fully exponential strategy) to make sure it could defend against any other AGI would be a lot like the pivotal act we’re discussing (and would count as one by the original definition). There would be other AGIs, but they’d be powerless against the sovereign or coalition of AGIs that exerted authority using their lead in exponential production. This wouldn’t happen in the nice multipolar future people envision in which people control lots of AGIs and they do wonderful things without anyone turning the moon into an army to control the future.
This is a separate issue from the one Hubinger addressed in Homogeneity vs. heterogeneity in AI takeoff scenarios (where the link is wrong, it just leads back to this post.) There’s he’s saying that if the first AGI is successfully aligned, the following ones probably will be too. I agree with this. There he’s not addressing the risks of personal intent aligned AGIs under different humans’ control.
My final comment from that thread:
My concern is that a bad actor will be the first to go all-out exponential. Other, better humans in charge of AGI will be reluctant to turn the moon much less the earth into military/industrial production, and to upend the power structure of the world. The worst actors will, by default, be the first go full exponential and ruthlessly offensive. Beyond that, I’m afraid the physics of the world does favor offense over defense. It’s pretty easy to release a lot of energy where you want it, and very hard to build anything that can withstand a nuke let alone a nova. But the dynamics are more complex than that, of course. So I think the reality is unknown. My point is that this scenario deserves some more careful thought.
Response to AI fragility claims
Here I agree with you entirely. I tink it’s fair to say that AGI isn’t safe by default, but the reasons you give and that excellent comment you quote show why safety of an AGI is readily achievable wit reasonable care.
Response to Lethality 10
Your argument that alignment generalizes farther than capabilities is quite interesting. I’m not sure I’d make a claim that strong, but I do think alignment generalizes about as far as capabilities—both quite far once you hit actual reasoning or sapience and understanding.
I do worry about The alignment stability problem WRT long-term alignment generalization. I think reflective stability will probably prove adequate for superhuman AGI’s alignment stability, but I’m not nearly sure enough to want to launch a value-aligned AGI even if I thought initial alignment would work.
Response to Lethality 11
Back to whether a pivotal act is necessary. Same as #6 above—agree that it’s not necessary to prevent misaligned AGI, think it is to prevent misaligned humans with AGIs aligned to follow their instructions.
c
Here I very much agree with that statement from the original LoL, while also largely agreeing with your reasoning for why we can overcome that problem. I expect fast capability gains when we reach “Real AGI” that can improve its reasoning and knowledge on its own without retraining. And I expect its alignment properties to be somewhat different than “aligning” a tool LLM that doesn’t have coherent agency. But I expect the synthetic data approach and instruction-following as the core tenet of that new entity to establish reflective stability, which will help alignment if it’s approximately on.
Response to Lethalities 16 and 17
I agree that the analogy with evolution doesn’t go very far, since we’re a lot smarter and have much better tools for alignment than evolution. We don’t have its ability to do trial and error to a staggering level, but this analogy only goes so far as to say we won’t get alignment right with only a shoddy, ill-considered attempt. Even OpenAI is going to at least try to do alignment, and put some thought into it. So the arguments for different techniques have to be taken on their merits.
I also think we won’t rely heavily on RL for alignment, as this and other Lethalities assume. I expect us to lean heavilty on Goals selected from learned knowledge: an alternative to RL alignment, for instance, by putting the prompt “act as an agent following instructions from users (x))” at the heart of our LLM agent proto-AGI.
Response to Lethalities 17 & 18
Agreed; language points to objects in the world adequately for humans that use it carefully, so I expect it to also be adequate for superhuman AGI that’s taking instructions from and so working in close collaboration with intelligent, careful humans.
Response to Lethality 21
You say, and I very much agree:
Except I expect this to be used to reference what humans mean, not what they value. I expect do-what-I-mean-and-check or instruction-following alignment strategies, and am not sure that full value alignment would work were it attempted in this manner. For that and other reasons, I expect instruction-following as the alignment target for all early AGI projects
Response to Lethality 22
I halfway agree and find the issues you raise fascinating, but I’m not sure they’re relevant to alignment.
I think evopsych was only wrong about how humans form values as a result of how very wrong they were about how humans get their capabilities.
Thus, I halfway agree that people get their values largely from the environment. I think we get our values largely from the environment and get our values largely from the way evolution designed our drives. They interact in complex ways that make both absolutely critical for the resulting values.
After 2+ decades of studying how human brains solve complex problems, I completely agree. We have the same brain plan as other mammals, we just have more compute and way better data (from culture, and our evolutionary drive to pay close attention to other humans).
So, what is the “simple core” of human values you mention? Is it what people have written about human values? I’d pretty much agree that that’s a usable core, even if it’s not simple.
Response to Lethality 23
Corrigibility as anti-natural.
I very much agree with Eliezer that his definition as an extra add-on is anti-natural. But you can get corrigibility in a natural way by making corrigibility (correctability) itself, or the closely related instruction-following, the single or primary alignment target.
To your list of references, including my own, I’ll add one:
I think Max Harms’ Corrigibility as Singular Target sequence is the definitive work on corrigibility in all its senses.
Response to Lethality 24
I very much agree with Yudkowsky’s framing here:
I just wrote about this critical point in Conflating value alignment and intent alignment is causing confusion.
I agree with Yud that a CEV sovereign is not a good idea for a first attempt. As explained in 23 above, I very much agree with you and disagree with Yudkowsky that the corrigiblity approach is also doomed.
Response to Lethality 25 - we don’t understand networks
I agree that interpretability has a lot of work to do before it’s useful- but I think it only needs to solvve one problem: is the AGI deliberately lying?
Responses to Lethalities 28 and 29 - we can’t check whether the outputs of a smarter than human AGI are aligned
Here I completely agree—misaligned superhuman AGI would make monkeys of us with no problem, even if we did box it and check its outputs—which we won’t.
That’s why we need a combination of having it tell us honestly about its motivations and thoughts (by instructing our personal intent aligned AGI to do so) and interpretability to discover when it’s lying.
Response to Lethality 32 - language isn’t a complete represention of underlying thoughts, so LLMs won’t reach AGI
Language has turned out to be a shockingly good training set. So LLMs will probably enable AGI, with a little scaffolding and additional cognitive systems (each reliant on the strengths of LLMs) to turn them into language model cognitive architectures. See Capabilities and alignment of LLM cognitive architectures. This year-old post isn’t nearly the extent of the detailed reasons I think such brain-inspired synthetic cognitive architectures will achieve proto-AGI relatively soon, because I’m not sure enough that they’re our best chance at alignment to start advancing capabilities.
WRT to the related but separate issue of language being an adequate reflection of their underlrying thoughts to allow alignment and transparency:
It isn’t, except if you want it to be and work to make sure it’s communicating your thoughts well enough.
There’s some critical stuff here about whether we apply enough RL or other training pressures to foundation models/agents to make their use of language not approximately reflect their underlying thoughts (and prevent translucent thoughts).
Response to Lethality 39
Yudkowsky may very well be the smartest human whose thought I’ve personally encountered. That doesn’t make him automatically right and the rest of us wrong. Expertise, time-on-task, and breadth of thought all count for a lot, far more in sum than sheer ability-to-juggle-concepts. Arguments count way more than appeals to authority (although those should be taken seriously too, and I definitely respect Yud’s authority on the topic).
Conclusion: we already have adequate alignment techniques to create aligned AGI
I agree, with the caveat that I think we can pull off personal intent alignment (corrigibility or instruciton-following) but not value alignment (CEV or similar sovereign AGI).
And I’m not sure, so I sure wish we could get some more people examining this- because people are going to try these alignment techniques, whether or not they work.
Excellent post! This is what we need more of in the alignment community: closely examining proposed alignment techniques.
Some thoughts on your excellent comment:
First, I fixed the link issue you saw.
I think the potential difference between you and me on whether synthetic data works to box an AI is whether the AI notices it’s being in a simulation made via synthetic data, and also I think that it’s not intended to be applied post-training, it’s instead applied continuously throughout the training process.
I agree that if we had an AGI/ASI that was already misaligned, we’d have to do pretty extreme actions like mindwipe it’s memories and restart the entire training process again, but the point of synthetic data is to get it into a basin of alignment/corrigibility early on, before it can be deceptive.
I also think that real data will be only given to AGIs at the end of training as a way to ground them, so it has no real way to know whether it’s subtly being changed in training or whether it’s in reality, since we control their data sources.
Controlling an AI’s data sources is a powerful way to control their values and capabilities, which was why I think that the tax for the synthetic data alignment is actually pretty low.
Re Response to Lethality 6, I’m honestly coming around to your position as I think more and more, at least to the extent that I think your arguments are plausible and we need more research on that.
Re Response to Lethality 10, I was relying on both empirical evidence from today’s models and some theoretical reasons for why the phenomenon of alignment generalizing further than capabilities exists in general.
On the alignment stability problem, I like the post, and we should plausibly do interventions to stabilize alignment once we get it.′
Re Response to Lethality 15, I agree with the idea that fast capability progress will happen, but deny the implication, because of both large synthetic datasets on values/instruction following already being in the AI when the fast capabilities progress happened, because synthetic data about values is pretrained in very early, combined with me being more optimistic about alignment generalization than you are.
I liked your Real AGI post by the way.
One concrete way we have better tools than evolution is we have far more control over what their data sources are, and more generally we have far more inspectability and controllability over their data, especially the synthetic kind, which means we don’t have to create very realistic simulations, since for all the AI knows, we might be elaborately fooling it to reveal itself, and until the very end of training, it probably doesn’t even have specific data on our reality.
Re Response to Lethality 21, you are exactly correct on what I meant.
Re Response to Lethality 22, on this:
You’re not wrong they got human capabilities very wrong, see this post for details:
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine
But I’d also argue that this has implications for how complex human values actually are.
On this:
Yeah, this seems like a crux, since I think that a lot of how values is learned is basically via quite weak priors from evolutionary drives, and I’d say the big one is probably which algorithm we have for being intelligent, and put far more weight on socialization/environment data than you do, closer to 5-10% evolution at best/85-90% data and culture determine our values.
But at any rate, since AIs will be influenced by their data a lot, this means that it’s tractable to influence their values.
Agree with this mostly, though culture is IMO the best explanation for why humans succeed.
Yes, I am talking about what people have written about human values, but I’m also talking about future synthetic data where we write about what values we want the AI to have, and I’m also talking about reward information as a simple core.
One of my updates on Constitiutional AI and GPT-4 handling what we value pretty well is that the claim that value is complicated is mostly untrue, and in general updated hard against evopsych explainations about what humans value, how humans got their capabilities and more, since the data we got is very surprising under evopsych hypotheses and less surprising under Universal Learning Machine hypotheses.
I agree with all of this for the record:
Re response to lethalities 28 and 29, I think you meant that you totally disagree, and I agree we’d probably be boned if a misaligned AGI/ASI was running loose, but my point is that the verification/generation gap that pervades so many fields also is likely to apply to alignment research, where it’s easier to verify if research is correct than to do it yourself.
Re response to Lethality 32:
There is definitely a chance that RL or other processes make their use of language diverge more from their thoughts, so I’m a little worried about that, but I do think that AI words do convey their thoughts, at least for current LLMs.
I think my difference wrt you is I consider his models and arguments for AI doom essentially irreparable for the most part due to reality invalidating his core assumptions of how AGIs/ASIs work, and also how human capabilities and values work and are learned, so I don’t think Yud’s authority on the topic earns him any epistemic points.
My point was basically that you cannot figure out anything using a null string as input, for the same reason you cannot update on no evidence of something happening as if there was evidence for something happening.
Agree with the rest of it though.
Thanks for your excellent comment, which gave me lots of food for thought.