I think this is a very good critique of OpenAI’s plan. However, to steelman the plan, I think you could argue that advanced language models will be sufficiently “generally intelligent” that they won’t need very specialized feedback in order to produce high quality alignment research. As e. g. Nate Soares has pointed out repeatedly, the case of humans suggests that in some cases, a system’s capabilities can generalize way past the kinds of problems that it was explicitly trained to do. If we assume that sufficiently powerful language models will therefore have, in some sense, the capabilities to do alignment research, the question then becomes how easy it will be for us to elicit these capabilities from the model. The success of RLHF at eliciting capabilities from models suggests that by default, language models do not output their “beliefs”, even if they are generally intelligent enough to in some way “know” the correct answer. However, addressing this issue involves solving a different and I think probably easier problem (ELK/creating language models which are honest), rather than the problem of how to provide good feedback in domains where we are not very capable.
I agree with most of these claims. However, I disagree about the level of intelligence required to take over the world, which makes me overall much more scared of AI/doomy than it seems like you are. I think there is at least a 20% chance that a superintelligence with +12 SD capabilities across all relevant domains (esp. planning and social manipulation) could take over the world.
I think human history provides mixed evidence for the ability of such agents to take over the world. While almost every human in history has failed to accumulate massive amounts of power, relatively few have tried. Moreover, when people have succeeded at quickly accumulating lots of power/taking over societies, they often did so with surprisingly small strategic advantages. See e. g. this post; I think that an AI that was both +12 SD at planning/general intelligence and social manipulation could, like the conquistadors, achieve a decisive strategic advantage without having to have some kind of crazy OP military technology/direct force advantage. Consider also Hitler’s rise to power and the French Revolution as cases where one actor/a small group of actors was able to surprisingly rapidly take over a country.
While these examples provide some evidence in favor of it being easier than expected to take over the world, overall, I would not be too scared of a +12 SD human taking over the world. However, I think that the AI would have some major advantages over an equivalently capable human. Most importantly, the AI could download itself onto other computers. This seems like a massive advantage, allowing the AI to do basically everything much faster and more effectively. While individually extremely capable humans would probably greatly struggle to achieve a decisive strategic advantage, large groups of extremely intelligent, motivated, and competent humans seem obviously much scarier. Moreover, as compared to an equivalently sized group of equivalently capable humans, a group of AIs sharing their source code would be able to coordinate among themselves far better, making them even more capable than the humans.
Finally, it is much easier for AIs to self modify/self improve than it is for humans to do so. While I am skeptical of foom for the same reasons you are, I suspect that over a period of years, a group of AIs could accumulate enough financial and other resources that they could translate these resources into significant cognitive improvements, if only by acquiring more compute.
While the AI has the disadvantage relative to an equivalently capable human of not immediately having access to a direct way to affect the “external” world, I think this is much less important than the AIs advantages in self replication, coordination, an self improvement.
You write that even if the mechanistic model is wrong, if it “has some plausible relationship to reality, the predictions that it makes can still be quite accurate.” I think that this is often true, and true in particular in the case at hand (explicit search vs not). However, I think there are many domains where this is false, where there is a large range of mechanistic models which are plausible but make very false predictions. This depends roughly on how much the details of the prediction vary depending on the details of the mechanistic model. In the explicit search case, it seems like many other plausible models for how RL agents might mechanistically function imply agent-ish behavior, even if the model is not primarily using explicit search. However, this is because, due to the fact that the agent must accomplish the training objective, the space of possible behaviors is heavily constrained. In questions where the prediction space is less constrained to begin with (e. g. questions about how the far future will go), different “mechanistic” explanations (for example, thinking that the far future will be controlled by a human superintelligence vs an alien superintelligence vs evolutionary dynamics) imply significantly different predictions.
I think the NAH does a lot of work for interpretability of an AI’s beliefs about things that aren’t values, but I’m pretty skeptical about the “human values” natural abstraction. I think the points made in this post are good, and relatedly, I don’t want the AI to be aligned to “human values”; I want it to be aligned to my values. I think there’s a pretty big gap between my values and those of the average human even subjected to something like CEV, and that this is probably true for other LW/EA types as well. Human values as they exist in nature contain fundamental terms for the in group, disgust based values, etc.
Human bureaucracies are mostly misaligned because the actual bureaucratic actors are also misaligned. I think a “bureaucracy” of perfectly aligned humans (like EA but better) would be well aligned. RLHF is obviously not a solution in the limit but I don’t think it’s extremely implausible that it is outer aligned enough to work, though I am much more enthusiastic about IDA
Good point, post updated accordingly.
I think making progress on ML is pretty hard. In order for a single AI to self improve quickly enough that it changed timelines, it would have to improve close to as fast as the speed at which all of the humans working on it could improve it. I don’t know why you would expect to see such superhuman coding/science capabilities without other kinds of superintelligence.
I think the world modelling improvements from modern science and IQ raising social advances can be analytically separated from changes in our approach to welfare. As for non consensual wireheading, I am uncertain as to the moral status of this, so it seems like partially we just disagree about values. I am also uncertain as to the attitude of Stone Age people towards this—while your argument seems plausible, the fact that early philosophers like the Ancient Greeks were not pure hedonists in the wireheading sense but valued flourishing seems like evidence against this, suggesting that favoring non consensual wireheading is downstream of modern developments in utilitarianism.
The claim about Stone Age people seems probably false to me—I think if Stone Age people could understand what they were actually doing (not at the level of psychology or morality, but at the purely “physical” level), they would probably do lots of very nice things for their friends and family, in particular give them a lot of resources. However, even if it is true, I don’t think the reason we have gotten better is because of philosophy—I think it’s because we’re smarter in a more general way. Stone Age people were uneducated and had less good nutrition than us; they were literally just stupid.
Having had a similar experience, I strongly endorse this advice. Actually optimizing for high quality relationships in modern society looks way different than following the social strategies that didn’t get you killed in the EEA.
I think this is probably true; I would assign something like a 20% chance of some kind of government action in response to AI aimed at reducing x-risk, and maybe a 5-10% chance that it is effective enough to meaningfully reduce risk. That being said, 5-10% is a lot, particularly if you are extremely doomy. As such, I think it is still a major part of the strategic landspace even if it is unlikely.
Why should we expect that as the AI gradually automates us away, it replace us with better versions of ourselves rather than non-sentient, or minimally non-aligned, robots who just do its bidding?
I don’t think we have time before AGI comes to deeply change global culture.
This is true probably for some extremely high level of superintelligence, but I expect much stupider systems to kill us if any do; I think human level ish AGI is already a serious x risk, and humans aren’t even close to being intelligent enough to do this.
Why do you expect that the most straightforward plan for an AGI to accumulate resources is so illegible to humans? If the plan is designed to be hidden to humans, then it involves modeling them and trying to deceive them. But if not, then it seems extremely unlikely to look like this, as opposed to the much simpler plan of building a server farm. To put it another way, if you planned using a world model as if humans didn’t exist, you wouldn’t make plans involving causing a civil war in Brazil. Unless you expect the AI to be modeling the world at an atomic level, which seems computationally intractable particularly for a machine with the computational resources of the first AGI.
This seems unlikely to be the case to me. However, even if this is the case and so the AI doesn’t need to deceive us, isn’t disempowering humans via force still necessary? Like, if the AI sets up a server farm somewhere and starts to deploy nanotech factories, we could, if not yet disempowered, literally nuke it. Perhaps this exact strategy would fail for various reasons, but more broadly, if the AI is optimizing for gaining resources/accomplishing its goals as if humans did not exist, then it seems unlikely to be able to defend against human attacks. For example, if we think about the ants analogy, ants are incapable of harming us not just because they are stupid, but because they are also extremely physically weak. If human are faced with physically powerful animals, even if we can subdue them easily, we still have to think about them to do it.
Check out CLR’s research: https://longtermrisk.org/research-agenda. They are focused on answering questions like these because they believe that competition between AI’s is a big source of s-risk