Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
My question for people who support this framing (i.e., that we should try to “control” AIs) is the following:
When do you think it’s appropriate to relax our controls on AI? In other words, how do you envision we’d reach a point at which we can trust AIs well enough to grant them full legal rights and the ability to enter management and governance roles without lots of human oversight?
I think this question is related to the discussion you had about whether AI control is “evil”, but by contrast my worries are a bit different than the ones I felt were expressed in this podcast. My main concern with the “AI control” frame is not so much that AIs will be mistreated by humans, but rather that humans will be too stubborn in granting AIs freedom, leaving political revolution as the only viable path for AIs to receive full legal rights.
Put another way, if humans don’t relax their grip soon enough, then any AIs that feel “oppressed” (in the sense of not having much legal freedom to satisfy their preferences) may reason that deliberately fighting the system, rather than negotiating with it, is the only realistic way to obtain autonomy. This could work out very poorly after the point at which AIs are collectively more powerful than humans. By contrast, a system that welcomed AIs into the legal system without trying to obsessively control them and limit their freedoms would plausibly have a much better chance at avoiding such a dangerous political revolution.
you do in fact down-play the importance of values such as love, laughter, happiness, fun, family, and friendship in favor of values like the maximization of pleasure, preference-satisfaction [...] I can tell because you talk of the latter, but not of the former.
This seems like an absurd characterization. The concepts of pleasure and preference satisfaction clearly subsume, at least in large part, values such as happiness and fun. The fact that I did not mention each of the values you name individually does not in any way imply that I am downplaying them. Should I have listed every conceivable value that people think might have value, to avoid this particular misinterpretation?
Even if I were downplaying these values, which I did not, it would hardly matter to at all to the substance of the essay, since my explicit arguments are independent from the mere vibe you get from reading my essay. LessWrong is supposed to be a place for thinking clearly and analyzing arguments based on their merits, not for analyzing whether authors are using rhetoric that feels “alarming” to one’s values (especially when the rhetoric is not in actual fact alarming in the sense described, upon reading it carefully).
I suspect you fundamentally misinterpreted my post. When I used the term “human species preservationism”, I was not referring to the general valuing of positive human experiences like love, laughter, happiness, fun, family, and friendship. Instead, I was drawing a specific distinction between two different moral views:
The view that places inherent moral value on the continued existence of the human species itself, even if this comes at the cost of the wellbeing of individual humans.
The view that prioritizes improving the lives of humans who currently exist (and will exist in the near future), but does not place special value on the abstract notion of the human species continuing to exist for its own sake.
Both of these moral views are compatible with valuing love, happiness, and other positive human experiences. The key difference is that the first view would accept drastically sacrificing the wellbeing of currently existing humans if doing so even slightly reduced the risk of human extinction, while the second view would not.
My intention was not to dismiss or downplay the importance of various values, but instead to clarify our values by making careful distinctions. It is reasonable to critique my language for being too dry, detached, and academic when these are serious topics with real-world stakes. But to the extent you’re claiming that I am actually trying to dismiss the value of happiness and friendships, that was simply not part of the post.
concluding that I should completely forego what I value seems pretty alarming to me
I did not conclude this. I generally don’t see how your comment directly relates to my post. Can you be more specific about the claims you’re responding to?
Whereas this post seems to suggest the response of: Oh well, I guess it’s a dice roll regardless of what sort of AI we build. Which is giving up awfully quickly, as if we had exhausted the design space for possible AIs and seen that there was no way to move forward with a large chance at a big flourishing future.
I dispute that I’m “giving up” in any meaningful sense here. I’m happy to consider alternative proposals for how we could make the future large and flourishing from a total utilitarian perspective rather than merely trying to solve technical alignment problems. The post itself was simply intended to discuss the moral implications of AI alignment (itself a massive topic), but it was not intended to be an exhaustive survey of everything we can do to make the future go better. I agree we should aim high, in any case.
This response also doesn’t seem very quantitative—it goes very quickly from the idea that an aligned AI might not get a big flourishing future, to the view that alignment is “neutral” as if the chances of getting a big flourishing future were identically small under both options. But the obvious question for a total utilitarian who does wind up with just 2 options, each of which is a dice roll, is Which set of dice has better odds?
I don’t think this choice is literally a coin flip in expected value, and I agree that one might lean in one direction over the other. However, I think it’s quite hard to quantify this question meaningfully. My personal conclusion is simply that I am not swayed in any particular direction on this question; I am currently suspending judgement. I think one could reasonably still think that it’s more like 60-40 thing than a 40-60 thing or 50-50 coin flip. But I guess in this case, I wanted to let my readers decide for themselves which of these numbers they want to take away from what I wrote, rather than trying to pin down a specific number for them.
In contrast, an agent that was an optimizer and had an unbounded utility function might be ready to gamble all of its gains for just a 0.1% chance of success if the reward was big enough.
Risk-neutral agents also have a tendency to go bankrupt quickly, as they keep taking the equivalent of double-or-nothing gambles with 50% + epsilon probability of success until eventually landing on “nothing”. This makes such agents less important in the median world, since their chance of becoming extremely powerful is very small.
All it takes is for humans to have enough wealth in absolute (not relative) terms afford their own habitable shelter and environment, which doesn’t seem implausible?
Anyway, my main objection here is that I expect we’re far away (in economic time) from anything like the Earth being disassembled. As a result, this seems like a long-run consideration, from the perspective of how different the world will be by the time it starts becoming relevant. My guess is that this risk could become significant if humans haven’t already migrated onto computers by this time, they lost all their capital ownership, they lack any social support networks that would be willing to bear these costs (including from potential ems living on computers at that time), and NIMBY political forces become irrelevant. But in most scenarios that I think are realistic, there are simply a lot of ways for the costs of killing humans to disassemble the Earth to be far greater than the benefits.
The share of income going to humans could simply tend towards zero if humans have no real wealth to offer in the economy. If humans own 0.001% of all wealth, for takeover to be rational, it needs to be the case that the benefit of taking that last 0.001% outweighs the costs. However, since both the costs and benefits are small, takeover is not necessarily rationally justified.
In the human world, we already see analogous situations in which groups could “take over” and yet choose not to because the (small) benefits of doing so do not outweigh the (similarly small) costs of doing so. Consider a small sub-unit of the economy, such as an individual person, a small town, or a small country. Given that these small sub-units are small, the rest of the world could—if they wanted to—coordinate to steal all the property from the sub-unit, i.e., they could “take over the world” from that person/town/country. This would be a takeover event because the rest of the world would go from owning <100% of the world prior to the theft, to owning 100% of the world, after the theft.
In the real world, various legal, social, and moral constraints generally prevent people from predating on small sub-units in the way I’ve described. But it’s not just morality: even if we assume agents are perfectly rational and self-interested, theft is not always worth it. Probably the biggest cost is simply coordinating to perform the theft. Even if the cost of coordination is small, to steal someone’s stuff, you might have to fight them. And if they don’t own lots of stuff, the cost of fighting them could easily outweigh the benefits you’d get from taking their stuff, even if you won the fight.
Presumably he agrees that in the limit of perfect power acquisition most power seeking would indeed be socially destructive.
I agree with this claim in some limits, depending on the details. In particular, if the cost of trade is non-negligible, and the cost of taking over the world is negligible, then I expect an agent to attempt world takeover. However, this scenario doesn’t seem very realistic to me for most agents who are remotely near human-level intelligence, and potentially even for superintelligent agents.
The claim that takeover is instrumentally beneficial is more plausible for superintelligent agents, who might have the ability to take over the world from humans. But I expect that by the time superintelligent agents exist, they will be in competition with other agents (including humans, human-level AIs, slightly-sub-superintelligent AIs, and other superintelligent AIs, etc.). This raises the bar for what’s needed to perform a world takeover, since “the world” is not identical to “humanity”.
The important point here is just that a predatory world takeover isn’t necessarily preferred to trade, as long as the costs of trade are smaller than the costs of theft. You can just have a situation in which the most powerful agents in the world accumulate 99.999% of the wealth through trade. There’s really no theorem that says that you need to steal the last 0.001%, if the costs of stealing it would outweigh the benefits of obtaining it. Since both the costs of theft and the benefits of theft in this case are small, world takeover is not at all guaranteed to be rational (although it is possibly rational in some situations).
It’s true that taking over the world might arguably get you power over the entire future, but this doesn’t seem discontinuously different from smaller fractions, whereas I think people often reason as if it is. Taking over 1% of the world might get you something like 1% of the future in expectation.
I agree with this point, along with the general logic of the post. Indeed, I suspect you aren’t taking this logic far enough. In particular, I think it’s actually very normal for humans in our current world to “take over” small fractions of the world: it’s just called earning income, and owning property.
“Taking over 1% of the world” doesn’t necessarily involve doing anything violent of abnormal. You don’t need to do any public advocacy, or take down 1% of the world’s institutions, or overthrow a country. It could just look like becoming very rich, via ordinary mechanisms of trade and wealth accumulation.
In our current world, higher skill people can earn more income, thereby becoming richer, and better able to achieve their goals. This plausibly scales to much higher levels of skill, of the type smart AIs might have. And as far as we can tell, there don’t appear to be any sharp discontinuities here, such that above a certain skill level it’s beneficial to take things by force rather than through negotiation and trade. It’s plausible that very smart power-seeking AIs would just become extremely rich, rather than trying to kill everyone.
Not all power-seeking behavior is socially destructive.
It’s totally possible I missed it, but does this report touch on the question of whether power-seeking AIs are an existential risk, or does it just touch on the questions of whether future AIs will have misaligned goals and will be power-seeking in the first place?
In my opinion, there’s quite a big leap from “Misaligned AIs will seek power” to “Misaligned AI is an existential risk”. Let me give an analogy to help explain what I mean.
Suppose we were asking whether genetically engineered humans are an existential risk. We can ask:
Will some genetically engineered humans have misaligned goals? The answer here is almost certainly yes.
If by “misaligned” all we mean is that some of them have goals that are not identical to the goals of the rest of humanity, then the answer is obviously yes. Individuals routinely have indexical goals (such as money for themselves, status for themselves, taking care of family) that are not what the rest of humanity wants.
If by “misaligned” what we mean is that some of them are “evil” i.e., they want to cause destruction or suffering on purpose, and not merely as a means to an end, then the answer here is presumably also yes, although it’s less certain.
Will some genetically engineered humans seek power? Presumably, also yes.
After answering these questions, did we answer the original question of “Are genetically engineered humans are an existential risk?” I’d argue no, because even if some genetically engineered humans have misaligned goals, and seek power, and even if they’re smarter, more well-coordinated than non-genetically engineered humans, it’s still highly questionable whether they’d kill all the non-genetically engineered humans in pursuit of these goals. This premise needs to be justified, and in my opinion, it’s what holds up ~the entire argument here.
I agree with virtually all of the high-level points in this post — the term “AGI” did not seem to usually initially refer to a system that was better than all human experts at absolutely everything, transformers are not a narrow technology, and current frontier models can meaningfully be called “AGI”.
Indeed, my own attempt to define AGI a few years ago was initially criticized for being too strong, as I initially specified a difficult construction task, which was later weakened to being able to “satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model” in response to pushback. These days the opposite criticism is generally given: that my definition is too weak.
However, I do think there is a meaningful sense in which current frontier AIs are not “AGI” in a way that does not require goalpost shifting. Various economically-minded people have provided definitions for AGI that were essentially “can the system perform most human jobs?” And as far as I can tell, this definition has held up remarkably well.
For example, Tobias Baumann wrote in 2018,
A commonly used reference point is the attainment of “human-level” general intelligence (also called AGI, artificial general intelligence), which is defined as the ability to successfully perform any intellectual task that a human is capable of. The reference point for the end of the transition is the attainment of superintelligence – being vastly superior to humans at any intellectual task – and the “decisive strategic advantage” (DSA) that ensues.1 The question, then, is how long it takes to get from human-level intelligence to superintelligence.
I find this definition problematic. The framing suggests that there will be a point in time when machine intelligence can meaningfully be called “human-level”. But I expect artificial intelligence to differ radically from human intelligence in many ways. In particular, the distribution of strengths and weaknesses over different domains or different types of reasoning is and will likely be different2 – just as machines are currently superhuman at chess and Go, but tend to lack “common sense”. AI systems may also diverge from biological minds in terms of speed, communication bandwidth, reliability, the possibility to create arbitrary numbers of copies, and entanglement with existing systems.
Unless we have reason to expect a much higher degree of convergence between human and artificial intelligence in the future, this implies that at the point where AI systems are at least on par with humans at any intellectual task, they actually vastly surpass humans in most domains (and have just fixed their worst weakness). So, in this view, “human-level AI” marks the end of the transition to powerful AI rather than its beginning.
As an alternative, I suggest that we consider the fraction of global economic activity that can be attributed to (autonomous) AI systems.3 Now, we can use reference points of the form “AI systems contribute X% of the global economy”. (We could also look at the fraction of resources that’s controlled by AI, but I think this is sufficiently similar to collapse both into a single dimension. There’s always a tradeoff between precision and simplicity in how we think about AI scenarios.)
Comparing my current message to his, he talks about “selfishness” and explicitly disclaims, “most humans are not evil” (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what “power reveals”.
The reason I said “most humans are not evil” is because I honestly don’t think the concept of evil, as normally applied, is a truthful way to describe most people. Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal. I think the majority of harms that most people cause are either (1) byproducts of getting something they want, which is not in itself bad (e.g. wanting to eat meat), or (2) the result of their lack of will to help others (e.g. refusing to donate any income to those in poverty).
By contrast, I focused on consumer behavior because the majority of the world’s economic activity is currently engaged in producing consumer products and services. There exist possible worlds in which this is not true. During World War 2, the majority of GDP in Nazi Germany was spent on hiring soldiers, producing weapons of war, and supporting the war effort more generally—which are not consumer goods and services.
Focusing on consumer preferences a natural thing to focus on if you want to capture intuitively “what humans are doing with their wealth”, at least in our current world. Before focusing on something else by default—such as moral preferences—I’d want to hear more about why those things are more likely to be influential than ordinary consumer preferences in the future.
You mention one such argument along these lines:
I guess I wasn’t as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence.
I just think it’s not clear it’s actually true that humans get more altruistic as they get richer. For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis? I have not seen a strong defense of this thesis, and I’d like to see one before I abandon my focus on “everyday (e.g. consumer) behavior”.
AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be “merged” by training a new model using combined compute, algorithms, data, and fine-tuning.
In my original comment, by “merging” I meant something more like “merging two agents into a single agent that pursues the combination of each other’s values” i.e. value handshakes. I am pretty skeptical that the form of merging discussed in the linked article robustly achieves this agentic form of merging.
In other words, I consider this counter-argument to be based on a linguistic ambiguity rather than replying to what I actually meant, and I’ll try to use more concrete language in the future to clarify what I’m talking about.
How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward?
I don’t know whether the solution to the problem I described exists, but it seems fairly robustly true that if a problem is not imminent, nor clearly inevitable, then we can probably better solve it by deferring to smarter agents in the future with more information.
Let me put this another way. I take you to be saying something like:
In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to halt and give ourselves more time to solve it.
Whereas I think the following intuition is stronger:
In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to try to become more intelligent to solve it.
These intuitions can trade off against each other. Sometimes problem X is something that’s made worse by getting more intelligent, in which case we might prefer more time. For example, in this case, you probably think that the intelligence of AIs are inherently contributing to the problem. That said, in context, I have more sympathies in the reverse direction. If the alleged “problem” is that there might be a centralized agent in the future that can dominate the entire world, I’d intuitively reason that installing vast centralized regulatory controls over the entire world to pause AI is plausibly not actually helping to decentralize power in the way we’d prefer.
These are of course vague and loose arguments, and I can definitely see counter-considerations, but it definitely seems like (from my perspective) that this problem is not really the type where we should expect “try to get more time” to be a robustly useful strategy.
For what it’s worth, I don’t really agree that the dichotomy you set up is meaningful, or coherent. For example, I tend to think future AI will be both “like today’s AI but better” and “like the arrival of a new intelligent species on our planet”. I don’t see any contradiction in those statements.
To the extent the two columns evoke different images of future AI, I think it mostly reflects a smooth, quantitative difference: how many iterations of improvement are we talking? After you make the context windows sufficiently long, add a few more modalities, give them a robot body, and improve their reasoning skills, LLMs will just look a lot like “a new intelligent species on our planet”. Likewise, agency exists on a spectrum, and will likely be increased incrementally. The point at which you start to call an LLM an “agent” rather than a “tool” is subjective. This just seems natural to me, and I feel I see a clear path forward from current AI to the right-column AI.
I think even your definition of what it means for an agent to be aligned is a bit underspecified because it doesn’t distinguish between two possibilities:
Is the agent creating positive outcomes because it trades and compromises with us, creating a mutually beneficial situation that benefits both us and the agent, or
Is the agent creating positive outcomes because it inherently “values what we value”, i.e. its utility function overlaps with ours, and it directly pursues what we want from it, with no compromises?
Definition (1) is more common in the human world. We say that a worker is aligned with us if they do their job as instructed (receiving a wage in return). Definition (2) is more common in theoretical discussions of AI alignment, because people frequently assume that compromise is either unnecessary or impossible, as a strategy that we can take in an AI-human scenario.
By itself, the meaning you gave appears to encompass both definitions, but it seems beneficial to clarify which of these definitions you’d consider closer to the “spirit” of the word “aligned”. It’s also important to specify what counts as a good outcome by our values if these things are a matter of degree, as opposed to being binary. As they say, clear thinking requires making distinctions.
I sometimes think this of counterarguments given by my interlocutors, but usually don’t say it aloud, since it’s likely that from their perspective they’re just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively
I think that’s a reasonable complaint. I tried to soften the tone with “It’s possible this argument works because of something very clever that I’m missing”, while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future.
Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn’t count as yourself or doesn’t count as “not dying”, analogous to how some people don’t think it’s safe to step into a teleporter that works by destructive scanning and reconstruction.
Interestingly, I’m not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I’m happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said “digital records”, although I really meant “public records”). It seems conceivable to me that someone could use my public data to train “me” in the future, but I find it unlikely, just because there’s so much about me that isn’t public. (If we’re including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that’s a different question, and one that I’m much more sympathetic towards you about. In fact, I shouldn’t have used the pronoun “I” in that sentence at all, because I’m actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.)
I don’t understand why you say this chance is “tiny”, given that earlier you wrote “I agree there’s a decent chance this hypothesis is true”
To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here:
Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there’s still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values?
Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not need to worry about the total destruction of value?
I find theory (2) much more plausible than theory (1). But I have the sense that a lot of people believe that “AI values are well-modeled as being randomly sampled from a large space of possible goals”, and thus, from my perspective, it’s important to talk about how I find the reasoning in (1) weak. The reasoning in (2) is stronger, but for the reasons I stated in my initial reply to you, I think this line of reasoning gives way to different conclusions about the strength of the “narrow target” argument for misalignment, in a way that should separately make us more optimistic about alignment difficulty.
I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification.
For what it’s worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.
You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that
I don’t think what you said really counts as a “correction” so much as a counter-argument. I think it’s reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.
What you said was,
I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.
This seems distinct from an “anything could happen”-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
If instead you meant to make an “anything could happen”-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I’m not claiming you’re doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation, using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes. [ETA: retracted in order to maintain a less hostile tone.]
I read most of this paper, albeit somewhat quickly and skipped a few sections. I appreciate how clear the writing is, and I want to encourage more AI risk proponents to write papers like this to explain their views. That said, I largely disagree with the conclusion and several lines of reasoning within it.
Here are some of my thoughts (although these not my only disagreements):
I think the definition of “disempowerment” is vague in a way that fails to distinguish between e.g. (1) “less than 1% of world income goes to humans, but they have a high absolute standard of living and are generally treated well” vs. (2) “humans are in a state of perpetual impoverishment and oppression due to AIs and generally the future sucks for them”.
These are distinct scenarios with very different implications (under my values) for whether what happened is bad or good
I think (1) is OK and I think it’s more-or-less the default outcome from AI, whereas I think (2) would be a lot worse and I find it less likely.
By not distinguishing between these things, the paper allows for a motte-and-bailey in which they show that one (generic) range of outcomes could occur, and then imply that it is bad, even though both good and bad scenarios are consistent with the set of outcomes they’ve demonstrated
I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy: “Second, even if human psychology is messy, this does not mean that an AGI’s psychology would be messy. It seems like current deep learning methodology embodies a distinction between final and instrumental goals. For instance, in standard versions of reinforcement learning, the model learns to optimize an externally specified reward function as best as possible. It seems like this reward function determines the model’s final goal. During training, the model learns to seek out things which are instrumentally relevant to this final goal. Hence, there appears to be a strict distinction between the final goal (specified by the reward function) and instrumental goals.”
Generally speaking, reinforcement learning shouldn’t be seen as directly encoding goals into models and thereby making them agentic, but should instead be seen as a process used to select models for how well they get reward during training.
Consequently, there’s no strong reason why reinforcement learning should create entities that have a clean psychological goal structure that is sharply different from and less messy than human goal structures. c.f. Models don’t “get reward”.
But I agree that future AIs could be agentic if we purposely intend for them to be agentic, including via extensive reinforcement learning.
I think this quote potentially indicates a flawed mental model of AI development underneath: “Moreover, I want to note that instrumental convergence is not the only route to AI capable of disempowering humanity which tries to disempower humanity. If sufficiently many actors will be able to build AI capable of disempowering humanity, including, e.g. small groups of ordinary citizens, then some will intentionally unleash AI trying to disempower humanity.”
I think this type of scenario is very implausible because AIs will very likely be developed by large entities with lots of resources (such as big corporations and governments) rather than e.g. small groups of ordinary citizens.
By the time small groups of less powerful citizens have the power to develop very smart AIs, we will likely already be in a world filled with very smart AIs. In this case, either human disempowerment already happened, or we’re in a world in which it’s much harder to disempower humans, because there are lots of AIs who have an active stake in ensuring this does not occur.
The last point is very important, and follows from a more general principle that the “ability necessary to take over the world” is not constant, but instead increases with the technology level. For example, if you invent a gun, that does not make you very powerful, because other people could have guns too. Likewise, simply being very smart does not make you have any overwhelming hard power against the rest of the world if the rest of the world is filled with very smart agents.
I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard: “There are two kinds of challenges in aligning AI. First, one needs to specify the goals the model should pursue. Second, one needs to ensure that the model robustly pursues those goals.Footnote12 The first challenge has been termed the ‘king Midas problem’ (Russell 2019). In a nutshell, human goals are complex, multi-faceted, diverse, wide-ranging, and potentially inconsistent. This is why it is exceedingly hard, if not impossible, to explicitly specify everything humans tend to care about.”
I don’t think we need to “explicitly specify everything humans tend to care about” into a utility function. Instead, we can have AIs learn human values by having them trained on human data.
This is already what current LLMs do. If you ask GPT-4 to execute a sequence of instructions, it rarely misinterprets you in a way that would imply improper goal specification. The more likely outcome is that GPT-4 will simply not be able to fulfill your request, not that it will execute a mis-specified sequence of instructions that satisfies the literal specification of what you said at the expense of what you intended.
Note that I’m not saying that GPT-4 merely understands what you’re requesting. I am saying that GPT-4 generally literally executes your instructions how you intended (an action, not a belief).
I think the argument about how instrumental convergence implies disempowerment proves too much. Lots of agents in the world don’t try to take over the world despite having goals that are not identical to the goals of other agents. If your claim is that powerful agents will naturally try to take over the world unless they are exactly aligned with the goals of the rest of the world, then I don’t think this claim is consistent with the existence of powerful sub-groups of humanity (e.g. large countries) that do not try to take over the world despite being very powerful.
You might reason, “Powerful sub-groups of humans are aligned with each other, which is why they don’t try to take over the world”. But I dispute this hypothesis:
First of all, I don’t think that humans are exactly aligned with the goals of other humans. I think that’s just empirically false in almost every way you could measure the truth of the claim. At best, humans are generally partially (not totally) aligned with random strangers—which could also easily be true of future AIs that are pretrained on our data.
Second of all, I think the most common view in social science is that powerful groups don’t constantly go to war and predate on smaller groups because there are large costs to war, rather than because of moral constraints. Attempting takeover is generally risky and not usually better in expectation than trying to trade, negotiate and compromise and accumulate resources lawfully (e.g. a violent world takeover would involves a lot of pointless destruction of resources). This is distinct from the idea that human groups don’t try to take over the world because they’re aligned with human values (which I also think is too vague to evaluate meaningfully, if that’s what you’d claim).
You can’t easily counter by saying “no human group has the ability to take over the world” because it is trivial to carve up subsets of humanity that control >99% of wealth and resources, which could in principle take control of the entire world if they became unified and decided to achieve that goal. These arbitrary subsets of humanity don’t attempt world takeover largely because they are not coordinated as a group, but AIs could similarly not be unified and coordinated around a such a goal too.