You’re allowed to care about things besides AI safety
I worry that a lot of AI safety / x-risk people have imbibed a vibe of urgency, impossibility, and overwhelming-importance to solving alignment in particular; that this vibe distorts thinking; that the social sphere around AI x-risk makes it harder for people to update.
Yesterday I talked to an AI safety researcher who said he’s pretty sure alignment will be solved by default. But whenever he talks to people about this, they just say “surely you don’t think it’s >99% likely? shouldn’t you just keep working for the sake of that 1% chance?”
Obviously there’s something real here − 1% of huge is huge. But equally—people should notice and engage when their top priority just got arguably 100x less important! And people should be socially-allowed to step back from pushing the boulder.
The idea that safety is the only thing that matters is pretty load-bearing for many people in this community, and that seems bad for epistemics and for well-being.
I’ve noticed similar feelings in myself—I think part of it is being stuck in the 2014 or even 2020 vibe of “jesus christ, society needs to wake up! AGI is coming, maybe very soon, and safety is a huge deal.” Now—okay, society-at-large still mostly doesn’t care, but—relevant bits of society (AI companies, experts, policymakers) are aware and many care a lot.
And if safety isn’t the only-overwhelming-priority, if it’s a tens of percents thing and not a 1-epsilon thing, we ought to care about the issues that persist when safety is solved—things like “how the hell does society actually wield this stuff responsibly”, “how do we keep it secure”, etc. And issues that frankly should have always been on the table, like “how do we avoid moral atrocities like torturing sentient AIs at scale”.
And on a personal & social level, we ought to care about investments that help us grapple with the situation—including supporting people as they step back from engaging directly with the problem, and try to figure out what else they could or should be doing.
Alignment by default is a minority opinion. Surveying the wide range range of even truly informed opinions, it seems clear to me that we collectively don’t know how hard alignment is.
But that doesn’t mean technical alignment is the only thing worth caring about, even if you’re a utilitarian. Societal issues surrounding AI could be crucial for success, and support for people doing work on AI safety are crucial even on a model in which AI is the most important topic. There’s also public outreach and lobbying work to be done.
And of course everyone needs to make prioritize their own emotional health so they can keep working on anything effectively.
Alignment by default is a minority opinion. Surveying the wide range range of even truly informed opinions, it seems clear to me that we collectively don’t know how hard alignment is.
Totally. I think it’s “arguable” in the sense of inside-views, not outside-views, if that makes sense? Like: it could be someone’s personal vibe that alignment-by-default is >99%. Should they have that as their all-things-considered view? Seems wrong to me, we should be considerably more uncertain here.
But okay, then: we should have some spread of bets across different possible worlds, and put a solid chunk of probability on alignment by default. Even if it’s a minority probability, this could matter a lot for what you actually try to do!
For example: I think worlds with short timelines, hard takeoff, and no alignment-by-default are pretty doomed. It’s easy to focus on those worlds and feel drawn to plans that are pretty costly and are incongruent with virtue and being-good-collaborators. -- e.g. the “we should have One Winning AGI Project that’s Safe and Smart Enough to Get Things Right”, the theory of victory that brought you OpenAI.
My intuition is that worlds with at least one of those variables flipped tend to convergently favor solutions that are more virtuous / collaborative and are more likely to fail gracefully.
(I’m tired and not maximally articulate rn, but could try to say more if that feels useful.)
My initial exposure to AI safety arguments was via Eliezer posts. My mental model of his logic goes something like:
“0) AI training will eventually yield high-quality agents;
1) high-quality agents will be utility maximizers;
2) utility maximizers will monomaniacally optimize for some world-feature;
3) therefore utility maximizers will seek Omohundro goals;
4) they’ll be smarter than us, so this will disempower us;
5) value is fragile, so empowered AIs monomaniacally optimizing for their utility function fucks us over with very high probability”
VNM doesn’t do what you want. As folks like @Rohin Shah and @nostalgebraist have pointed out, point 2 (and therefore 3 and 5) don’t really follow. A utility function can have lots of features! It can encode preferences about sequences of events, and therefore about patterns of behavior, so that the AI values interacting honorably with humans. It can value many world-features. The marginal utility can be diminishing in any particular valued feature, so that an AI-optimized world ends up richly detailed, rather than tiled with paperclips.
Without this misinterpretation of VNM, the classic argument gets weaker, and the threat model gets richer. On the safety side, you get a conditional argument, like “if monomaniacal optimization, then bad stuff.”
But there are other if-thens that lead to bad stuff—like “if someone instructs the AI to do bad things”, or “if AI helps authoritarians or terrorists better use their existing resources” or “if we hand over all meaningful control of human society to AIs”.
The argument gets weaker still given the evidence from the world we’re in. (This is again all kinda obvious-feeling, but I feel like some LW people would push against this.)
On the technical side, you all know it: LLMs are surprisingly low-agency and slow-takeoffy given their capability level. They’re human-language-native so it’s easy to specify human goals and they seem to understand them pretty well. Values training seems to work pretty well.
On the societal side, you have a world where SF is woken up to AGI and DC is waking up. Labs are aware of AI safety risks (and even some folks in DC are).
This all pushes back against points 2 and 5 (about AIs being monomaniacal optimizers that disregard human value).
In addition, takeoff might well be slow enough that we have lots of defensive tech investments, cyberdefense AIs, overseer AIs, etc. This pushes back against point 4 (powerful AI agents will be able to disempower humanity).
Here’s my update to the classic risk argument:
0) AI training will eventually yield high-quality agents;
1) These high-quality agents will be deployed at scale, but likely unequally
2) They might be targeted at goals inimical to human values, either intentionally through malice, quasi-intentionally (e.g. through broad disempowerment), or unintentionally (because of shoddy safety work).
3) without appropriate guardrails, they’ll seek Omohundro goals. (Or, if they’re intent-aligned, they may be directed to seek power at the expense of other human groups.)
4) At some capability level—likely between AGI and ASI—these agents will be able to deceive overseers & evaluators, including moderately weaker AIs. They’ll also plausibly be able to find weaknesses even in hardened infrastructure. This is much more worrying in worlds with harder takeoffs. In those worlds, vetted AI overseers might be much weaker than frontier AIs, and we won’t have much time to make defensive investments.
5) It’s not at all clear that this leads to a world with zero (or negative!) value. But these dynamics seem like clearly The Biggest Deal for how the long-term future goes, and so they’re well worth improving.
But “improving these dynamics” could mean improving governance, or AI deployment policies, or AI security—not just technical alignment work.
IMO, the best argument for AI safety looks something like this:
Eventually within this century, someone will deploy AIs that are able to make humans basically worthless at ~all jobs at a minimum.
Once you don’t depend on anyone else to survive, and once the society you are in is economically worthless or even has negative value from a selfish perspective because they can’t do anything relevant, and they cannot resist what you can do, there’s no reason not to steal from them/kill them anymore, because their property/land/capital isn’t worthless, but their labor is worthless, and an argument strengthener is whether AIs can develop technology that can make expropriation recover more of the value of the property.
Thus, you need AIs at this power level to terminally care about people/beings that have no power whatsoever, and they need to terminally value survival of beings that have 0 power/leverage, including humans.
This might be difficult to achieve, or not difficult, but we don’t yet know how difficult it is to align AIs that could displace all humans at jobs, and this is worrisome given the empirical evidence of how powerful entities have treated those with much less power, and importantly the end of World War II-today period of powerful people treating less powerful people well fundamentally rests on stuff that will break when AIs can take ~all the jobs.
We’ve never had to solve value alignment before, because the fact that everyone depends on everyone else for power means that institutional design/design that is robust to value misalignment works, and we can’t change people’s values.
Thus, there’s a reasonable chance of existential catastrophe happening if we build AI that can replace us at all jobs before we do serious alignment.
I’ll flag that I think pure LLMs are less relevant to takeover concerns than I once thought, so I am less optimistic than in 2024, and I’ll also say that the level of awareness now is unfortunately not very predictive of stuff like “If an AI model was clearly hacking it’s data-center, would there be a strong response like pausing/shutting down the AI model?”, and Buck gives some good reasons on why strong responses may not happen:
So while I don’t think value alignment is sufficient, I do think something like value alignment will be necessary in futures where AI controls everything and yet we have survived for more than a decade.
I notice I’m confused about how you think these thoughts slot in with mine. What you’re saying feels basically congruent with what I’m saying. My core points about orienting to safety, which you seem to agree with, are A) safety is necessary but not sufficient, and B) it might be easier to solve than other things we also need to get right. Maybe you disagree on B?
I will note—to me, your points 1⁄2also point strongly towards risks of authoritarianism & gradual disempowerment. It feels like a non sequitur to jump from them to point 3 about safety—I think the natural follow-up from someone not experienced with the path-dependent history of AI risk discourse would be “how do we make society work given these capabilities?” I’m curious if you left out that consideration because you think it’s less big than safety, or because you were focusing on the story for safety in particular.
I notice I’m confused about how you think these thoughts slot in with mine. What you’re saying feels basically congruent with what I’m saying. My core points about orienting to safety, which you seem to agree with, are A) safety is necessary but not sufficient, and B) it might be easier to solve than other things we also need to get right. Maybe you disagree on B?
I don’t disagree on A or B, for the record, and while I’ve updated on AI alignment being harder than I used to think, I’m still relatively uncertain about how difficult AI alignment actually is.
I will note—to me, your points 1⁄2 also point strongly towards risks of authoritarianism & gradual disempowerment.
I actually agree with this, but I’ll flag that I think the amount of value alignment that is necessary from AIs does mean that authoritarianism is likely to be way less bad for most human values (not all), because I view democracy and a lot of other governance structures as an attempt to rely less on value alignment and more on incentives, but for reasons I’ll get to later, I do think that value alignment is just way, way more necessary for you to survive under AI governance than under human governance, which brings us up to this:
It feels like a non sequitur to jump from them to point 3 about safety—I think the natural follow-up from someone not experienced with the path-dependent history of AI risk discourse would be “how do we make society work given these capabilities?” I’m curious if you left out that consideration because you think it’s less big than safety, or because you were focusing on the story for safety in particular.
In a literal sense, society will continue to work, even if they are warped immensely by AIs, but the reason why I left out the consideration of “how can we get a situation where we can maintain our survival without requiring the value alignment of the most powerful beings (by default AIs) once they take all human jobs?” is because I think it’s basically impossible to get an equilibrium where humans survive AI rule without assumptions around what the AI’s utility functions/values are, unlike in traditional economic modelling.
The reasons for this are 2 fold:
The human’s land/capital/property isn’t worthless, but their labor is worthless, and thus from a selfish perspective the reason to keep them alive/in good condition is gone, and you have no reason to invest in anything that helps them earn stuff to buy goods to fuel their consumption. Indeed, stealing property/killing the human is from a selfish perspective valuable, at least all other things being held equal.
Indeed, I like this quote from the intelligence curse explaining why you wouldn’t satisfy non-rich human demand to instead satisfy rich human/machine demand:
A common rebuttal is that some jobs can never be automated because we will demand humans do them.
For example, teachers. Most parents would probably strongly prefer a real, human teacher to watch their kids throughout the day. But this argument totally misses the bigger picture: it’s not that there won’t be a demand for teachers, it’s that there won’t be an incentive to fund schools. This argument repeats ad nauseam for anything that invests in regular people’s productive capacity, any luxury that relies on their surplus income, or any good that keeps them afloat. By default, powerful actors won’t build things that employ humans or provide them resources, because they won’t have to.
2. Conflict isn’t costly for AIs against baseline humans once AIs takeover, and thus there’s no ability to actually threaten them into giving us a share of the pie.
Conflict between AIs and humans (once AI has taken over), if it did happen would be closer to the European conflicts against Africa and North and South America from 1500-1900 at best, and probably closer to humanity vs most wild animal conflicts at worse, which ended in annihilation for tens or hundreds of thousands of species, or more.
Or put it into shorter terms by Jeremiah England:
It seems like there are two main reasons for treating someone well who you don’t care about: (1) they perform better for you when you do, (2) they will raise hell if you don’t.
This is why I said value alignment of AIs are ultimately necessary, and why you need to have AIs that terminally value beings that have 0 or negative usefulness for the AI in an economic sense thriving/surviving, because institutional solutions don’t work if they are trivial and beneficial to subvert, and economics favors AIs killing all humans to get more land and capital if the AIs are selfish.
Apologies for the scrappiness of the below—I wanted to respond but I have mostly a scattering of thoughts rather than solid takes.
I like the intelligence curse piece very much—it’s what I meant to reference when I linked the Turing Trap above, but I couldn’t remember the title & Claude pointed me to that piece instead. I agree with everything you’re saying directionally! But I feel some difference in emphasis or vibe that I’m curious about.
-
One response I notice having to your points is: why the focus on value alignment?
“We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition.”
And conversely: “if we have value alignment, I don’t think there’s a guarantee that we wind up in a basin of convergent human values, so you still have the problem of—whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?”
(Using quotes bc these feel more like ‘text completions from system 1’ than all-things-considered takes from system 2.)
-
Maybe there’s a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.
I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:
It’s pretty likely we end up with AIs that care about at least some of human value, e.g. valuing conscious experience. (at least if AGIs resemble current LLMs, which seem to imprint on humans quite a lot.)
AI experiences could themselves be deeply morally valuable, even if the AIs aren’t very human-aligned. (though you might need them to at minimum care about consciousness, so they don’t optimize it away)
A some-humans-led world could be at least as bad as an AI-led world, and very plausibly could have negative rather than zero value.
I think this is partly down to founder effects where Eliezer either didn’t buy these ideas or didn’t want to emphasize them (bc they cut against the framing of “alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana”).
-
I also notice some internal tension where part of me is like “the AIs don’t seem that scary in Noosphere’s world”. But another part is like “dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society—why isn’t that scary?”
I think where this is coming from is that I tend to focus on “transition dynamics” to the AGI future rather than “equilibrium dynamics” of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don’t see it, the returns to scale seem so good).
So maybe I’d say “if you’re not talking a classic AI takeover scenario, and you’re imagining a somewhat gradual takeoff,
my attention gets drawn to the ways humans and fundamental competitive dynamics screw things up
the iterative aspect of gradual takeoff means I’m less worried about alignment on its own. (still needs to get solved, but more likely to get solved.)”
One response I notice having to your points is: why the focus on value alignment?
“We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition.”
And conversely: “if we have value alignment, I don’t think there’s a guarantee that we wind up in a basin of convergent human values, so you still have the problem of—whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?”
(Using quotes bc these feel more like ‘text completions from system 1’ than all-things-considered takes from system 2.)
You’ve correctly noted the issue of why lots of people may not be safe even in a physical sense even assuming value alignment/corrigibility/intent alignment/instruction following is solved, and I do think you are correct that there is no guarantee that we wind up in a basin of convergence, and I’d even argue that it’s unlikely to converge and instead diverge because there is no 1 moral reality, and there are an infinite amount of correct moralities/moral realities, so yeah the oversight problem is pretty severe.
Maybe there’s a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.
I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:
It’s pretty likely we end up with AIs that care about at least some of human value, e.g. valuing conscious experience. (at least if AGIs resemble current LLMs, which seem to imprint on humans quite a lot.) AI experiences could themselves be deeply morally valuable, even if the AIs aren’t very human-aligned. (though you might need them to at minimum care about consciousness, so they don’t optimize it away) A some-humans-led world could be at least as bad as an AI-led world, and very plausibly could have negative rather than zero value. I think this is partly down to founder effects where Eliezer either didn’t buy these ideas or didn’t want to emphasize them (bc they cut against the framing of “alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana”).
So I’ll state a couple of things here.
On your first point, I think that AGIs probably will be quite different from current LLMs, mostly due to future AIs having continuous learning, a long-term memory and being more data efficient/sample efficient, and the most accessible way to make AIs more capable will route through using more RL.
On your second point, this as always depends on your point of view, because once again there’s no consistent answer that holds across all valid moralities.
On your third point, again this depends on your point of view, but if I use my inferred model of human values where most humans strongly disvalue dying/being tortured, I agree that a some-humans led world is at least as bad as an AI world, because I consider most of what makes humans being willing to be prosocial in situations where it’s low cost to do so to be unfortunately held up by things that are absolutely shredded once some humans can just not depend on other human beings anymore for a rich life, and not based on what the human values internally.
I also notice some internal tension where part of me is like “the AIs don’t seem that scary in Noosphere’s world”. But another part is like “dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society—why isn’t that scary?”
I think where this is coming from is that I tend to focus on “transition dynamics” to the AGI future rather than “equilibrium dynamics” of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don’t see it, the returns to scale seem so good).
Yes, this explains the dynamics of why I was more negative than you in your post, and the point was to argue against people like @Matthew Barnett and a lot of other people’s arguments that AI alignment doesn’t need to be solved, because AIs will follow human made laws and there will be enough positive sum trades such that the AIs, even if selfish will decide to not kill humans.
And my point is that unfortunately, in a post-AI takeover world any trade between most humans and AIs would be closer to an AI giving away stuff in return for nothing given up by the human, because the human as a living entity has 0 value, or even negative value from an economics perspective, and their land and property/capital are valuable, but are very easily stolen.
So if an AI didn’t terminally value the survival/thriving of people who have 0/negative value in an economics sense, then it’s quite likely that outright killing/warping the human severely is unfortunately favorable to the AI’s interest.
In essence, I was trying to say why conditional on you not controlling the AI (which I think happens in the long run), you really do need assumptions on the AI’s values to a much greater extent to survive than current humans in current human institutions.
So maybe I’d say “if you’re not talking a classic AI takeover scenario, and you’re imagining a somewhat gradual takeoff,
my attention gets drawn to the ways humans and fundamental competitive dynamics screw things up
the iterative aspect of gradual takeoff means I’m less worried about alignment on its own. (still needs to get solved, but more likely to get solved.)”
I do agree that in more gradual takeoffs, humans/competitive dynamics matter more, and alignment is more likely to be solved, defusing the implications I made (with the caveat that the standard of what counts as an AI being aligned will have to rise to extreme levels over time in a way people are not prepared for), so I agree that the alignment problem is less urgent, though I do think at least in the long run and even arguably in the medium term, a lot of the problems of competitive dynamics/human flaws screwing things up will ultimately require as a baseline leaders who actually value people/beings that have 0 power surviving and thriving, because if you do not have this, none of the other proposed solutions work, and I think it’s really important to say that compared to the 19th-21st century era in democracies, values are going to matter a lot more to how much humans thrive or die.
OK, cool, I think I understand where you’re coming from much better now. Seems like we basically agree and were just emphasizing different things in our original comments!
I’m in violent agreement that there’s a missing mood when people say “AIs will follow the law”. I think there’s something going on where people are like “but liberalism / decentralized competition have worked so well” and ignoring all the constraints on individual actors that make it so. Rule of law, external oversight, difficulty of conspiring with other humans, inefficiencies of gov’t that limit its ability to abuse power, etc.
And those constraints might all fall away with the AGI transition. That’s for a number of reasons: ownership of AGI could concentrate power; AGI complements existing power bases (e.g. govt has the authority but not a great ability to selectively enforce laws to silence opponents as mass-scale), it reduces the need for conspirators. As you note, it brings down others’ value as trading partners & collaborators. And takeoff dynamics could make things less like an iterated game and more like a one-shot. *taps head* can’t be punished if all your opponents are dead.
(I’m guessing you’d agree with all this, just posting to clarify where my head is at)
I guess these days the safety argument has shifted to “inner optimizers”, which I think means “OK fine, we can probably specify human values well enough that LLMs understand us. But what if the system learns some weird approximation of values—or worse conspires to fool us while secretly having other goals.” I don’t understand this well enough to have confident takes on it, but it feels like...a pretty conjunctive worry, a possibility worth guarding against but not a knockdown argument.
What’s up with incredibly successful geniuses having embarassing & confusing public meltdowns? What’s up with them getting into naziism in particular?
Components of my model:
Selecting for the tails of success selects for weird personalities; moderate success can come in lots of ways, but massive success in part requires just a massive amount of drive and self-confidence. Bipolar people have this. (But more than other personality types?)
Endless energy & willingness to engage with stuff is an amazing trait that can go wrong if you have an endless pit of stupid internet stuff grabbing for your attention.
If you’re selected for overconfidence and end up successful, you assume you’re amazing at everything. (And you are in fact great at some stuff, and have enough taste to know it, so it’s hard to change your mind.)
Selecting for the tails of success selects for contrarianism? Seems plausible—one path to great success, at least, is to make a huge contrarian bet that pays off.
Nothing’s more contrarian than being a Nazi, especially if you’re trying to flip the bird to the Cathedral.
What’s up with incredibly successful geniuses having embarassing & confusing public meltdowns? What’s up with them getting into naziism in particular?
Does this refer to anyone other than Elon?
But maybe the real question intended, is why any part of the tech world would side with Trumpian populism? You could start by noting that every modern authoritarian state (that has at least an industrial level of technology) has had a technical and managerial elite who support the regime. Nazi Germany, Soviet Russia, and Imperial Japan all had industrial enterprises, and the people who ran them participated in the ruling ideology. So did those in the British empire and the American republic.
Our current era is one in which an American liberal world order, with free trade and democracy as universal norms, is splintering back into one of multiple great powers and civilizational regions. Liberalism no longer had the will and the power to govern the world, the power vacuum was filled by nationalist strongmen overseas, and now in America too, one has stepped into the gap left by the weak late-liberal leadership, and is creating a new regime governed by different principles (balanced trade instead of free trade, spheres of influence rather than universal democracy, etc).
Trump and Musk are the two pillars of this new American order, and represent different parts of a coalition. Trump is the figurehead of a populist movement, Musk is foremost among the tech oligarchs. Trump is destroying old structures of authority and creating new ones around himself, Musk and his peers are reorganizing the entire economy around the technologies of the “fourth industrial revolution” (as they call it in Davos).
That’s the big picture according to me. Now, you talk about “public meltdowns” and “getting into naziism”. Again I’ll assume that this is referring to Elon Musk (I can’t think of anyone else). The only “meltdowns” I see from Musk are tweets or soundbites that are defensive or accusatory, and achieve 15 minutes of fame. None of it seems very meaningful to me. He feuds with someone, he makes a political statement, his fans and his haters take what they want, and none of it changes anything about the larger transformations occurring. It may be odd to see a near-trillionaire with a social media profile more like a bad-boy celebrity who can’t stay out of trouble, but it’s not necessarily an unsustainable persona.
As for “getting into naziism”, let’s try to say something about what his politics or ideology really are. Noah Smith just wrote an essay on “Understanding America’s New Right” which might be helpful. What does Elon actually say about his political agenda? First it was defeating the “woke mind virus”, then it was meddling in European politics, now it’s about DOGE and the combative politics of Trump 2.0.
I interpret all of these as episodes in the power struggle whereby a new American nationalism is displacing the remnants of the cosmopolitan globalism of the previous regime. The new America is still pretty cosmopolitan, but it does emphasize its European and Christian origins, rather than repressing them in favor of a secular progressivism that is intended to embrace the entire world.
In all this, there are echoes of the fascist opposition to communism in the 20th century, but in a farcical and comparatively peaceful form. Communism was a utopian secular movement that replaced capitalism and nationalism with a new kind of one-party dictatorship that could take root in any industrialized society. Fascism was a nationalist and traditionalist imitation of this political form, in which ethnicity rather than class was the decisive identity. They fought a war in which tens of millions died.
MAGA versus Woke, by comparison, is a culture war of salesmen versus hippies. Serious issues of war and peace, law and order, humanitarianism and national survival are interwoven with this struggle, because this is real life, but this has been a meme war more than anything, in which fascism and communism are just historical props.
I was thinking Kanye as well. Hence being more interested in the general pattern. really wasn’t intending to subtweet one person in particular—I have some sense of the particular dynamics there, though your comment is illuminating. :)
I wouldn’t generally dismiss an “embarassing & confusing public meltdown” when it comes from a genius. Because I’m not a genius while he or she is. So it’s probably me who is wrong rather than him. Well, except the majority of comparable geniuses agrees with me rather than with him. Though geniuses are rare, and majorities are hard to come by. I still remember an (at the time) “embarrassing and confusing meltdown” by some genius.
You’re allowed to care about things besides AI safety
I worry that a lot of AI safety / x-risk people have imbibed a vibe of urgency, impossibility, and overwhelming-importance to solving alignment in particular; that this vibe distorts thinking; that the social sphere around AI x-risk makes it harder for people to update.
Yesterday I talked to an AI safety researcher who said he’s pretty sure alignment will be solved by default. But whenever he talks to people about this, they just say “surely you don’t think it’s >99% likely? shouldn’t you just keep working for the sake of that 1% chance?”
Obviously there’s something real here − 1% of huge is huge. But equally—people should notice and engage when their top priority just got arguably 100x less important! And people should be socially-allowed to step back from pushing the boulder.
The idea that safety is the only thing that matters is pretty load-bearing for many people in this community, and that seems bad for epistemics and for well-being.
I’ve noticed similar feelings in myself—I think part of it is being stuck in the 2014 or even 2020 vibe of “jesus christ, society needs to wake up! AGI is coming, maybe very soon, and safety is a huge deal.” Now—okay, society-at-large still mostly doesn’t care, but—relevant bits of society (AI companies, experts, policymakers) are aware and many care a lot.
And if safety isn’t the only-overwhelming-priority, if it’s a tens of percents thing and not a 1-epsilon thing, we ought to care about the issues that persist when safety is solved—things like “how the hell does society actually wield this stuff responsibly”, “how do we keep it secure”, etc. And issues that frankly should have always been on the table, like “how do we avoid moral atrocities like torturing sentient AIs at scale”.
And on a personal & social level, we ought to care about investments that help us grapple with the situation—including supporting people as they step back from engaging directly with the problem, and try to figure out what else they could or should be doing.
Alignment by default is a minority opinion. Surveying the wide range range of even truly informed opinions, it seems clear to me that we collectively don’t know how hard alignment is.
But that doesn’t mean technical alignment is the only thing worth caring about, even if you’re a utilitarian. Societal issues surrounding AI could be crucial for success, and support for people doing work on AI safety are crucial even on a model in which AI is the most important topic. There’s also public outreach and lobbying work to be done.
And of course everyone needs to make prioritize their own emotional health so they can keep working on anything effectively.
Totally. I think it’s “arguable” in the sense of inside-views, not outside-views, if that makes sense? Like: it could be someone’s personal vibe that alignment-by-default is >99%. Should they have that as their all-things-considered view? Seems wrong to me, we should be considerably more uncertain here.
But okay, then: we should have some spread of bets across different possible worlds, and put a solid chunk of probability on alignment by default. Even if it’s a minority probability, this could matter a lot for what you actually try to do!
For example: I think worlds with short timelines, hard takeoff, and no alignment-by-default are pretty doomed. It’s easy to focus on those worlds and feel drawn to plans that are pretty costly and are incongruent with virtue and being-good-collaborators. -- e.g. the “we should have One Winning AGI Project that’s Safe and Smart Enough to Get Things Right”, the theory of victory that brought you OpenAI.
My intuition is that worlds with at least one of those variables flipped tend to convergently favor solutions that are more virtuous / collaborative and are more likely to fail gracefully.
(I’m tired and not maximally articulate rn, but could try to say more if that feels useful.)
Refurbishing the classic AI safety argument
My initial exposure to AI safety arguments was via Eliezer posts. My mental model of his logic goes something like:
“0) AI training will eventually yield high-quality agents;
1) high-quality agents will be utility maximizers;
2) utility maximizers will monomaniacally optimize for some world-feature;
3) therefore utility maximizers will seek Omohundro goals;
4) they’ll be smarter than us, so this will disempower us;
5) value is fragile, so empowered AIs monomaniacally optimizing for their utility function fucks us over with very high probability”
VNM doesn’t do what you want. As folks like @Rohin Shah and @nostalgebraist have pointed out, point 2 (and therefore 3 and 5) don’t really follow. A utility function can have lots of features! It can encode preferences about sequences of events, and therefore about patterns of behavior, so that the AI values interacting honorably with humans. It can value many world-features. The marginal utility can be diminishing in any particular valued feature, so that an AI-optimized world ends up richly detailed, rather than tiled with paperclips.
Without this misinterpretation of VNM, the classic argument gets weaker, and the threat model gets richer. On the safety side, you get a conditional argument, like “if monomaniacal optimization, then bad stuff.”
But there are other if-thens that lead to bad stuff—like “if someone instructs the AI to do bad things”, or “if AI helps authoritarians or terrorists better use their existing resources” or “if we hand over all meaningful control of human society to AIs”.
The argument gets weaker still given the evidence from the world we’re in. (This is again all kinda obvious-feeling, but I feel like some LW people would push against this.)
On the technical side, you all know it: LLMs are surprisingly low-agency and slow-takeoffy given their capability level. They’re human-language-native so it’s easy to specify human goals and they seem to understand them pretty well. Values training seems to work pretty well.
On the societal side, you have a world where SF is woken up to AGI and DC is waking up. Labs are aware of AI safety risks (and even some folks in DC are).
This all pushes back against points 2 and 5 (about AIs being monomaniacal optimizers that disregard human value).
In addition, takeoff might well be slow enough that we have lots of defensive tech investments, cyberdefense AIs, overseer AIs, etc. This pushes back against point 4 (powerful AI agents will be able to disempower humanity).
Here’s my update to the classic risk argument:
0) AI training will eventually yield high-quality agents;
1) These high-quality agents will be deployed at scale, but likely unequally
2) They might be targeted at goals inimical to human values, either intentionally through malice, quasi-intentionally (e.g. through broad disempowerment), or unintentionally (because of shoddy safety work).
3) without appropriate guardrails, they’ll seek Omohundro goals. (Or, if they’re intent-aligned, they may be directed to seek power at the expense of other human groups.)
4) At some capability level—likely between AGI and ASI—these agents will be able to deceive overseers & evaluators, including moderately weaker AIs. They’ll also plausibly be able to find weaknesses even in hardened infrastructure. This is much more worrying in worlds with harder takeoffs. In those worlds, vetted AI overseers might be much weaker than frontier AIs, and we won’t have much time to make defensive investments.
5) It’s not at all clear that this leads to a world with zero (or negative!) value. But these dynamics seem like clearly The Biggest Deal for how the long-term future goes, and so they’re well worth improving.
But “improving these dynamics” could mean improving governance, or AI deployment policies, or AI security—not just technical alignment work.
IMO, the best argument for AI safety looks something like this:
Eventually within this century, someone will deploy AIs that are able to make humans basically worthless at ~all jobs at a minimum.
Once you don’t depend on anyone else to survive, and once the society you are in is economically worthless or even has negative value from a selfish perspective because they can’t do anything relevant, and they cannot resist what you can do, there’s no reason not to steal from them/kill them anymore, because their property/land/capital isn’t worthless, but their labor is worthless, and an argument strengthener is whether AIs can develop technology that can make expropriation recover more of the value of the property.
Thus, you need AIs at this power level to terminally care about people/beings that have no power whatsoever, and they need to terminally value survival of beings that have 0 power/leverage, including humans.
This might be difficult to achieve, or not difficult, but we don’t yet know how difficult it is to align AIs that could displace all humans at jobs, and this is worrisome given the empirical evidence of how powerful entities have treated those with much less power, and importantly the end of World War II-today period of powerful people treating less powerful people well fundamentally rests on stuff that will break when AIs can take ~all the jobs.
We’ve never had to solve value alignment before, because the fact that everyone depends on everyone else for power means that institutional design/design that is robust to value misalignment works, and we can’t change people’s values.
Thus, there’s a reasonable chance of existential catastrophe happening if we build AI that can replace us at all jobs before we do serious alignment.
I’ll flag that I think pure LLMs are less relevant to takeover concerns than I once thought, so I am less optimistic than in 2024, and I’ll also say that the level of awareness now is unfortunately not very predictive of stuff like “If an AI model was clearly hacking it’s data-center, would there be a strong response like pausing/shutting down the AI model?”, and Buck gives some good reasons on why strong responses may not happen:
https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai
So while I don’t think value alignment is sufficient, I do think something like value alignment will be necessary in futures where AI controls everything and yet we have survived for more than a decade.
Thanks for the reply!
I notice I’m confused about how you think these thoughts slot in with mine. What you’re saying feels basically congruent with what I’m saying. My core points about orienting to safety, which you seem to agree with, are A) safety is necessary but not sufficient, and B) it might be easier to solve than other things we also need to get right. Maybe you disagree on B?
I will note—to me, your points 1⁄2 also point strongly towards risks of authoritarianism & gradual disempowerment. It feels like a non sequitur to jump from them to point 3 about safety—I think the natural follow-up from someone not experienced with the path-dependent history of AI risk discourse would be “how do we make society work given these capabilities?” I’m curious if you left out that consideration because you think it’s less big than safety, or because you were focusing on the story for safety in particular.
I don’t disagree on A or B, for the record, and while I’ve updated on AI alignment being harder than I used to think, I’m still relatively uncertain about how difficult AI alignment actually is.
I actually agree with this, but I’ll flag that I think the amount of value alignment that is necessary from AIs does mean that authoritarianism is likely to be way less bad for most human values (not all), because I view democracy and a lot of other governance structures as an attempt to rely less on value alignment and more on incentives, but for reasons I’ll get to later, I do think that value alignment is just way, way more necessary for you to survive under AI governance than under human governance, which brings us up to this:
In a literal sense, society will continue to work, even if they are warped immensely by AIs, but the reason why I left out the consideration of “how can we get a situation where we can maintain our survival without requiring the value alignment of the most powerful beings (by default AIs) once they take all human jobs?” is because I think it’s basically impossible to get an equilibrium where humans survive AI rule without assumptions around what the AI’s utility functions/values are, unlike in traditional economic modelling.
The reasons for this are 2 fold:
The human’s land/capital/property isn’t worthless, but their labor is worthless, and thus from a selfish perspective the reason to keep them alive/in good condition is gone, and you have no reason to invest in anything that helps them earn stuff to buy goods to fuel their consumption. Indeed, stealing property/killing the human is from a selfish perspective valuable, at least all other things being held equal.
Indeed, I like this quote from the intelligence curse explaining why you wouldn’t satisfy non-rich human demand to instead satisfy rich human/machine demand:
From this link:
https://intelligence-curse.ai/defining/
2. Conflict isn’t costly for AIs against baseline humans once AIs takeover, and thus there’s no ability to actually threaten them into giving us a share of the pie.
Conflict between AIs and humans (once AI has taken over), if it did happen would be closer to the European conflicts against Africa and North and South America from 1500-1900 at best, and probably closer to humanity vs most wild animal conflicts at worse, which ended in annihilation for tens or hundreds of thousands of species, or more.
Or put it into shorter terms by Jeremiah England:
https://x.com/JeremiahEnglan5/status/1929371594553438245
This is why I said value alignment of AIs are ultimately necessary, and why you need to have AIs that terminally value beings that have 0 or negative usefulness for the AI in an economic sense thriving/surviving, because institutional solutions don’t work if they are trivial and beneficial to subvert, and economics favors AIs killing all humans to get more land and capital if the AIs are selfish.
Apologies for the scrappiness of the below—I wanted to respond but I have mostly a scattering of thoughts rather than solid takes.
I like the intelligence curse piece very much—it’s what I meant to reference when I linked the Turing Trap above, but I couldn’t remember the title & Claude pointed me to that piece instead. I agree with everything you’re saying directionally! But I feel some difference in emphasis or vibe that I’m curious about.
-
One response I notice having to your points is: why the focus on value alignment?
“We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition.”
And conversely: “if we have value alignment, I don’t think there’s a guarantee that we wind up in a basin of convergent human values, so you still have the problem of—whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?”
(Using quotes bc these feel more like ‘text completions from system 1’ than all-things-considered takes from system 2.)
-
Maybe there’s a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.
I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:
It’s pretty likely we end up with AIs that care about at least some of human value, e.g. valuing conscious experience. (at least if AGIs resemble current LLMs, which seem to imprint on humans quite a lot.)
AI experiences could themselves be deeply morally valuable, even if the AIs aren’t very human-aligned. (though you might need them to at minimum care about consciousness, so they don’t optimize it away)
A some-humans-led world could be at least as bad as an AI-led world, and very plausibly could have negative rather than zero value.
I think this is partly down to founder effects where Eliezer either didn’t buy these ideas or didn’t want to emphasize them (bc they cut against the framing of “alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana”).
-
I also notice some internal tension where part of me is like “the AIs don’t seem that scary in Noosphere’s world”. But another part is like “dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society—why isn’t that scary?”
I think where this is coming from is that I tend to focus on “transition dynamics” to the AGI future rather than “equilibrium dynamics” of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don’t see it, the returns to scale seem so good).
So maybe I’d say “if you’re not talking a classic AI takeover scenario, and you’re imagining a somewhat gradual takeoff,
my attention gets drawn to the ways humans and fundamental competitive dynamics screw things up
the iterative aspect of gradual takeoff means I’m less worried about alignment on its own. (still needs to get solved, but more likely to get solved.)”
Some thoughts on this:
You’ve correctly noted the issue of why lots of people may not be safe even in a physical sense even assuming value alignment/corrigibility/intent alignment/instruction following is solved, and I do think you are correct that there is no guarantee that we wind up in a basin of convergence, and I’d even argue that it’s unlikely to converge and instead diverge because there is no 1 moral reality, and there are an infinite amount of correct moralities/moral realities, so yeah the oversight problem is pretty severe.
So I’ll state a couple of things here.
On your first point, I think that AGIs probably will be quite different from current LLMs, mostly due to future AIs having continuous learning, a long-term memory and being more data efficient/sample efficient, and the most accessible way to make AIs more capable will route through using more RL.
On your second point, this as always depends on your point of view, because once again there’s no consistent answer that holds across all valid moralities.
On your third point, again this depends on your point of view, but if I use my inferred model of human values where most humans strongly disvalue dying/being tortured, I agree that a some-humans led world is at least as bad as an AI world, because I consider most of what makes humans being willing to be prosocial in situations where it’s low cost to do so to be unfortunately held up by things that are absolutely shredded once some humans can just not depend on other human beings anymore for a rich life, and not based on what the human values internally.
Yes, this explains the dynamics of why I was more negative than you in your post, and the point was to argue against people like @Matthew Barnett and a lot of other people’s arguments that AI alignment doesn’t need to be solved, because AIs will follow human made laws and there will be enough positive sum trades such that the AIs, even if selfish will decide to not kill humans.
And my point is that unfortunately, in a post-AI takeover world any trade between most humans and AIs would be closer to an AI giving away stuff in return for nothing given up by the human, because the human as a living entity has 0 value, or even negative value from an economics perspective, and their land and property/capital are valuable, but are very easily stolen.
So if an AI didn’t terminally value the survival/thriving of people who have 0/negative value in an economics sense, then it’s quite likely that outright killing/warping the human severely is unfortunately favorable to the AI’s interest.
In essence, I was trying to say why conditional on you not controlling the AI (which I think happens in the long run), you really do need assumptions on the AI’s values to a much greater extent to survive than current humans in current human institutions.
I do agree that in more gradual takeoffs, humans/competitive dynamics matter more, and alignment is more likely to be solved, defusing the implications I made (with the caveat that the standard of what counts as an AI being aligned will have to rise to extreme levels over time in a way people are not prepared for), so I agree that the alignment problem is less urgent, though I do think at least in the long run and even arguably in the medium term, a lot of the problems of competitive dynamics/human flaws screwing things up will ultimately require as a baseline leaders who actually value people/beings that have 0 power surviving and thriving, because if you do not have this, none of the other proposed solutions work, and I think it’s really important to say that compared to the 19th-21st century era in democracies, values are going to matter a lot more to how much humans thrive or die.
OK, cool, I think I understand where you’re coming from much better now. Seems like we basically agree and were just emphasizing different things in our original comments!
I’m in violent agreement that there’s a missing mood when people say “AIs will follow the law”. I think there’s something going on where people are like “but liberalism / decentralized competition have worked so well” and ignoring all the constraints on individual actors that make it so. Rule of law, external oversight, difficulty of conspiring with other humans, inefficiencies of gov’t that limit its ability to abuse power, etc.
And those constraints might all fall away with the AGI transition. That’s for a number of reasons: ownership of AGI could concentrate power; AGI complements existing power bases (e.g. govt has the authority but not a great ability to selectively enforce laws to silence opponents as mass-scale), it reduces the need for conspirators. As you note, it brings down others’ value as trading partners & collaborators. And takeoff dynamics could make things less like an iterated game and more like a one-shot. *taps head* can’t be punished if all your opponents are dead.
(I’m guessing you’d agree with all this, just posting to clarify where my head is at)
I guess these days the safety argument has shifted to “inner optimizers”, which I think means “OK fine, we can probably specify human values well enough that LLMs understand us. But what if the system learns some weird approximation of values—or worse conspires to fool us while secretly having other goals.” I don’t understand this well enough to have confident takes on it, but it feels like...a pretty conjunctive worry, a possibility worth guarding against but not a knockdown argument.
What’s up with incredibly successful geniuses having embarassing & confusing public meltdowns? What’s up with them getting into naziism in particular?
Components of my model:
Selecting for the tails of success selects for weird personalities; moderate success can come in lots of ways, but massive success in part requires just a massive amount of drive and self-confidence. Bipolar people have this. (But more than other personality types?)
Endless energy & willingness to engage with stuff is an amazing trait that can go wrong if you have an endless pit of stupid internet stuff grabbing for your attention.
If you’re selected for overconfidence and end up successful, you assume you’re amazing at everything. (And you are in fact great at some stuff, and have enough taste to know it, so it’s hard to change your mind.)
Selecting for the tails of success selects for contrarianism? Seems plausible—one path to great success, at least, is to make a huge contrarian bet that pays off.
Nothing’s more contrarian than being a Nazi, especially if you’re trying to flip the bird to the Cathedral.
Does this refer to anyone other than Elon?
But maybe the real question intended, is why any part of the tech world would side with Trumpian populism? You could start by noting that every modern authoritarian state (that has at least an industrial level of technology) has had a technical and managerial elite who support the regime. Nazi Germany, Soviet Russia, and Imperial Japan all had industrial enterprises, and the people who ran them participated in the ruling ideology. So did those in the British empire and the American republic.
Our current era is one in which an American liberal world order, with free trade and democracy as universal norms, is splintering back into one of multiple great powers and civilizational regions. Liberalism no longer had the will and the power to govern the world, the power vacuum was filled by nationalist strongmen overseas, and now in America too, one has stepped into the gap left by the weak late-liberal leadership, and is creating a new regime governed by different principles (balanced trade instead of free trade, spheres of influence rather than universal democracy, etc).
Trump and Musk are the two pillars of this new American order, and represent different parts of a coalition. Trump is the figurehead of a populist movement, Musk is foremost among the tech oligarchs. Trump is destroying old structures of authority and creating new ones around himself, Musk and his peers are reorganizing the entire economy around the technologies of the “fourth industrial revolution” (as they call it in Davos).
That’s the big picture according to me. Now, you talk about “public meltdowns” and “getting into naziism”. Again I’ll assume that this is referring to Elon Musk (I can’t think of anyone else). The only “meltdowns” I see from Musk are tweets or soundbites that are defensive or accusatory, and achieve 15 minutes of fame. None of it seems very meaningful to me. He feuds with someone, he makes a political statement, his fans and his haters take what they want, and none of it changes anything about the larger transformations occurring. It may be odd to see a near-trillionaire with a social media profile more like a bad-boy celebrity who can’t stay out of trouble, but it’s not necessarily an unsustainable persona.
As for “getting into naziism”, let’s try to say something about what his politics or ideology really are. Noah Smith just wrote an essay on “Understanding America’s New Right” which might be helpful. What does Elon actually say about his political agenda? First it was defeating the “woke mind virus”, then it was meddling in European politics, now it’s about DOGE and the combative politics of Trump 2.0.
I interpret all of these as episodes in the power struggle whereby a new American nationalism is displacing the remnants of the cosmopolitan globalism of the previous regime. The new America is still pretty cosmopolitan, but it does emphasize its European and Christian origins, rather than repressing them in favor of a secular progressivism that is intended to embrace the entire world.
In all this, there are echoes of the fascist opposition to communism in the 20th century, but in a farcical and comparatively peaceful form. Communism was a utopian secular movement that replaced capitalism and nationalism with a new kind of one-party dictatorship that could take root in any industrialized society. Fascism was a nationalist and traditionalist imitation of this political form, in which ethnicity rather than class was the decisive identity. They fought a war in which tens of millions died.
MAGA versus Woke, by comparison, is a culture war of salesmen versus hippies. Serious issues of war and peace, law and order, humanitarianism and national survival are interwoven with this struggle, because this is real life, but this has been a meme war more than anything, in which fascism and communism are just historical props.
Thanks for your thoughts!
I was thinking Kanye as well. Hence being more interested in the general pattern. really wasn’t intending to subtweet one person in particular—I have some sense of the particular dynamics there, though your comment is illuminating. :)
I wouldn’t generally dismiss an “embarassing & confusing public meltdown” when it comes from a genius. Because I’m not a genius while he or she is. So it’s probably me who is wrong rather than him. Well, except the majority of comparable geniuses agrees with me rather than with him. Though geniuses are rare, and majorities are hard to come by. I still remember an (at the time) “embarrassing and confusing meltdown” by some genius.