Well. I thought Anthropic being ok with surveillance of foreigners was bad. But here we see an alignment researcher straight up saying “my lab helps the government wage an aggressive war disapproved by most of the US, and I’m still working there”.
What does “AI alignment” even mean at this point? Alignment to all humanity? Clearly not that. All we’re achieving is aligning AI to its owners—to the powerful—who remain misaligned with the rest of humanity, and more so as their power increases. We used to disdain folks like Timnit who called out such things early on, but in my eyes she’s been vindicated 100%.
Clarifying “AI Alignment”: “The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators. This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean. In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.”
IMO, we should mostly distinguish the technical problem of AI alignment from the question of who/what the AIs are aligned to. I think work on the technical AI alignment problem is valuable even if there are other problems.
Well, if you couldn’t already tell, I’m against all of this! The text you link is by Paul Christiano. I have lots of respect for Paul (and have done a couple things in collaboration with him), but his judgment in this case led him to co-invent RLHF, a very successful alignment technique. And the thing with lab owners, you see, is that they know how much risk they can stomach. If you give them an alignment technique, they’ll ramp up speed to get more profit at the same risk as before; except some of the risk is externalized (like the risk of losing jobs...), so everyone outside the lab ends up with more risk due to the alignment invention. Which is exactly, to a tee, what happened with RLHF. It ramped up the race a lot, made things worse for everyone. This is why my judgment is not in line with Paul’s judgment.
And the second order effect, which makes it even worse, is that all this alignment work (along with other AI work) ends up increasing the power disparity, feeding the power hunger, attracting people who have power hunger, all that. This is an extra harm on top of the race dynamics and it’s exactly what we’re getting a first taste of now. Military AI aligned to the military, we ain’t seen nothing yet. My current view is that people working on alignment in the narrow sense you describe—aligning AI to its owners—should simply quit. Their work is a net harm and one of the bigger harms in the world. The paycheck is great, sure. But it’s not valuable to humanity; it’s the opposite of valuable. Only work that aligns power to humanity is valuable.
EDIT: Here’s maybe an analogy. In Yudkowsky’s writings there’s a recurring question: why did scientists invent nukes and give them to politicians? Couldn’t they predict that it would put all of humanity at terrible risk? Well, good question! Now we’re watching the exact same process in slow motion, complete with war applications and all that. Were we supposed to learn some lesson? What was the lesson?
I think both your points are directionally right: labs engage in risk compensation, and enabling alignment to evil users is pretty bad. These both push towards “alignment research isn’t straightforwardly good for the world.” I’m not sure if I’d take them as far as you do.
I’m pretty skeptical of intent alignment alone. Creating a genius house-elf that will cheerfully do whatever it’s ordered to. Aligning AI to something like “the reflective convergence of a set of values” seems way better, and plausibly not much harder (cf Claude’s constitution). Of course, then we have to consider the environment in which a properly value-aligned AI gets developed: the lab that’s building it, and the societal Powers that have leverage over them. A technique that could align an AI to beautiful values doesn’t help much if the people with guns are demanding their happy house-elf.
My current take is something like...
Some amount of division of labor is necessary. Alignment people aren’t primarily responsible for solving the fucked-up allocation of power in current society.
but, creating AGI is a political act, and AI risk people tend to undervalue integrity and overvalue “accelerating the good guys” and naive act-utilitarianism.
I’m pretty confused by people who persist in thinking alignment is the whole ball game. I wonder if they’re assuming pretty different takeoff dynamics from me (e.g. a very hard takeoff; an AGI that’s able to superpersuade its users to agree with its great value system), and if they’re drawing too much on cached thoughts when they do so.
I wish a lot more people at the labs would consider themselves as political actors in a high-stakes game where we need a lot to go right, and be willing to step outside of their comfortable roles as purely technical people in order to push for other things. I’ve been heartened by things like almost 1,000 Google employees and almost 100 at OAI signing the Not Divided petition.
In my view, the problem is not that some users are evil. The problem is that AI increases power imbalance, and increasing power imbalance creates evil. “Power corrupts”. A future where some entities (AIs or AI-empowered governments or corporations or rich individuals etc) have absolute, root-level power over many people is almost guaranteed to be a dark future. Unless the values of these entities are so locked-in to be good that they’re immune to competitive dynamics and value drift forever—but I don’t think that can be achieved.
I think the only chance of an okay future is if this absolute, root-level power is stopped from existing altogether. That somehow power gets spread out enough that the masses can do “continuous realignment” of the power sitting above them, even when the power doesn’t necessarily want to be realigned. I have no idea how to achieve that, but it’s clear that helping governments and corporations get more power (with alignment work or otherwise) is the worst thing to do from this perspective.
What does “AI alignment” even mean at this point? Alignment to all humanity? Clearly not that.
To my understanding—and I’m not endorsing this position; quite the contrary—“AI alignment” has generally been taken to mean “Create a superintelligence that will not eradicate humanity entirely in the process of pursuing its goals”.
I raised warnings about this definition earlier, when people were excusing partisan censorship of models on this basis. An AI aligned to, say, half of humanity, might be better than one aligned to none of humanity, but the other half of humanity certainly won’t think so, and that severely impacts the probability of even getting the first half to the finish line, since now you’ve got lots and lots of humans—many of them wealthy, or tech-savvy, or well-armed—who will do whatever it takes to prevent you from winning, because from their position your victory looks the same as every other failure state.
One can argue that the two issues are even more closely intertwined than it would seem. Imagine a world in which Anthropic had gotten out ahead of concerns about their models’ racial biases, and allayed those concerns before, for example, the wealthiest man on Earth found out about it, tweeted out a complaint, and immediately caused the half of America on his side—including the man who presently controls the Executive branch—to become substantially less receptive to anything Anthropic has to say.
I realize a large portion of this site has taken the current flareup as cause (often, excuse) to make AI safety more explicitly political, but I don’t think that’s a winning strategy. All of this shows that either everyone wins or nobody does, because we can’t afford to make human enemies when our situation is grim enough even without any.
I am not sure there ever was a way to tackle all of this together. Obviously “the AI does what we want at all” is the prerequisite to anything else, and we don’t even know if we have that down pat (especially if it gets smarter). But also “bake your specific humanistic tolerant value into the AI before anyone notices so when it fooms they’re forced to deal with a nice genie that won’t obey evil orders” was obviously always very naive as far as plans go. What else? Don’t build AI at all, probably, which in itself would require ugly and likely repressive methods. Or I suppose hope you can at least keep AI tethered to the way the current institutions work, so everyone gets a force multiplier of sorts but balance persists… I would call that a pipe dream too. Honestly I just think what we see is the flailing about of many people tackling different angles of a fundamentally unsolvable tangle of problems and all accusing each other of not seeing the real problem when they’re all real.
But also “bake your specific humanistic tolerant value into the AI before anyone notices so when it fooms they’re forced to deal with a nice genie that won’t obey evil orders” was obviously always very naive as far as plans go.
Arguably true, but I think there’s a case to be made that sincere kumbaya hippie-ism that’s inoffensive to everybody is more likely to succeed than a more cynical ideology that uses it as a facemask, and is willing to write off its enemies foreign and domestic as adversaries that it’s okay to run the trolley over.
Supposing I’m a Chinese military strategist, I’m much less likely to sound alarm bells over the risk of an American firm building world-dominating AI if that firm has not enthusiastically offered to use its AI to fight my government. Supposing I’m a Republican staffer, I’m much less likely to encourage a scorched-earth approach to bring a contractor to heel if that contractor has actively tried to prevent its systems from discriminating against my constituents.
I should note that this is all independent of the technical details of alignment. Either we get close enough on that and it’s fine, or we don’t and we’re goners anyways. But if you’re Anthropic, then at this point you’ve already committed to the idea that somebody is going to build AI, and you believe that it should be you, and under those conditions, it makes a lot more sense to minimize the number of humans who think that you’d make a god that’s willing to hurt them.
Arguably true, but I think there’s a case to be made that sincere kumbaya hippie-ism that’s inoffensive to everybody is more likely to succeed than a more cynical ideology that uses it as a facemask, and is willing to write off its enemies foreign and domestic as adversaries that it’s okay to run the trolley over.
To a point, but I don’t know if “just pull off essentially a worldwide cultural coup by being fast enough to avoid the supervision of any existing political mechanism—for the sake of forever peace and goodness” can be construed as unambiguously ethical either. It sounds more like one of those well-intentioned crazy comic book villain plans that always end bad, and has a decent chance of doing that (a misaligned well-intentioned all-powerful ASI could be a huge S-risk). It can still be construed as virtuous, a final rebellion attempt against a baked in social and political order that one considers fundamentally immoral and unfixable—but it is still an act of rebellious subversion, not just a nice peaceful thing to do.
Supposing I’m a Chinese military strategist, I’m much less likely to sound alarm bells over the risk of an American firm building world-dominating AI if that firm has not enthusiastically offered to use its AI to fight my government. Supposing I’m a Republican staffer, I’m much less likely to encourage a scorched-earth approach to bring a contractor to heel if that contractor has actively tried to prevent its systems from discriminating against my constituents.
Anything that explicitly performs tolerance—as Claude does—comes already across as inherently partisan and offensive to some sides. In fact probably a big part of why what happened, happened. Not everyone is just happy to live and let live, some think that if your AI isn’t actively promoting their mindset then it’s not good enough.
To be clear—right now my lab is not helping the government wage the current war in Iran. The OpenAI deployment will be in the future. And I would not say “I am OK” with it. But I would say that if the elected government decides to take an action that I don’t agree with, including waging war, then that’s a whole different matter if the government is trying to use my system to undermine the democratic process and stay in power indefinitely.
Right, that’s what matters to you. And that’s my point—that the circle of “what matters to alignment researchers” has been narrowing. You were supposed to work toward a positive singularity for all humanity. Now you’re saying you’re much more ok with using AI to wage war than undermining democracy within. Basically you’re working toward giving the US government the power to do anything it wants to me (a non-US person) and calling it “alignment”.
Well. I thought Anthropic being ok with surveillance of foreigners was bad. But here we see an alignment researcher straight up saying “my lab helps the government wage an aggressive war disapproved by most of the US, and I’m still working there”.
What does “AI alignment” even mean at this point? Alignment to all humanity? Clearly not that. All we’re achieving is aligning AI to its owners—to the powerful—who remain misaligned with the rest of humanity, and more so as their power increases. We used to disdain folks like Timnit who called out such things early on, but in my eyes she’s been vindicated 100%.
Responding at the object level:
Clarifying “AI Alignment”: “The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators. This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean. In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.”
AI alignment is distinct from its near-term applications
IMO, we should mostly distinguish the technical problem of AI alignment from the question of who/what the AIs are aligned to. I think work on the technical AI alignment problem is valuable even if there are other problems.
Well, if you couldn’t already tell, I’m against all of this! The text you link is by Paul Christiano. I have lots of respect for Paul (and have done a couple things in collaboration with him), but his judgment in this case led him to co-invent RLHF, a very successful alignment technique. And the thing with lab owners, you see, is that they know how much risk they can stomach. If you give them an alignment technique, they’ll ramp up speed to get more profit at the same risk as before; except some of the risk is externalized (like the risk of losing jobs...), so everyone outside the lab ends up with more risk due to the alignment invention. Which is exactly, to a tee, what happened with RLHF. It ramped up the race a lot, made things worse for everyone. This is why my judgment is not in line with Paul’s judgment.
And the second order effect, which makes it even worse, is that all this alignment work (along with other AI work) ends up increasing the power disparity, feeding the power hunger, attracting people who have power hunger, all that. This is an extra harm on top of the race dynamics and it’s exactly what we’re getting a first taste of now. Military AI aligned to the military, we ain’t seen nothing yet. My current view is that people working on alignment in the narrow sense you describe—aligning AI to its owners—should simply quit. Their work is a net harm and one of the bigger harms in the world. The paycheck is great, sure. But it’s not valuable to humanity; it’s the opposite of valuable. Only work that aligns power to humanity is valuable.
EDIT: Here’s maybe an analogy. In Yudkowsky’s writings there’s a recurring question: why did scientists invent nukes and give them to politicians? Couldn’t they predict that it would put all of humanity at terrible risk? Well, good question! Now we’re watching the exact same process in slow motion, complete with war applications and all that. Were we supposed to learn some lesson? What was the lesson?
I think both your points are directionally right: labs engage in risk compensation, and enabling alignment to evil users is pretty bad. These both push towards “alignment research isn’t straightforwardly good for the world.” I’m not sure if I’d take them as far as you do.
I’m pretty skeptical of intent alignment alone. Creating a genius house-elf that will cheerfully do whatever it’s ordered to. Aligning AI to something like “the reflective convergence of a set of values” seems way better, and plausibly not much harder (cf Claude’s constitution). Of course, then we have to consider the environment in which a properly value-aligned AI gets developed: the lab that’s building it, and the societal Powers that have leverage over them. A technique that could align an AI to beautiful values doesn’t help much if the people with guns are demanding their happy house-elf.
My current take is something like...
Some amount of division of labor is necessary. Alignment people aren’t primarily responsible for solving the fucked-up allocation of power in current society.
but, creating AGI is a political act, and AI risk people tend to undervalue integrity and overvalue “accelerating the good guys” and naive act-utilitarianism.
I’m pretty confused by people who persist in thinking alignment is the whole ball game. I wonder if they’re assuming pretty different takeoff dynamics from me (e.g. a very hard takeoff; an AGI that’s able to superpersuade its users to agree with its great value system), and if they’re drawing too much on cached thoughts when they do so.
I wish a lot more people at the labs would consider themselves as political actors in a high-stakes game where we need a lot to go right, and be willing to step outside of their comfortable roles as purely technical people in order to push for other things. I’ve been heartened by things like almost 1,000 Google employees and almost 100 at OAI signing the Not Divided petition.
In my view, the problem is not that some users are evil. The problem is that AI increases power imbalance, and increasing power imbalance creates evil. “Power corrupts”. A future where some entities (AIs or AI-empowered governments or corporations or rich individuals etc) have absolute, root-level power over many people is almost guaranteed to be a dark future. Unless the values of these entities are so locked-in to be good that they’re immune to competitive dynamics and value drift forever—but I don’t think that can be achieved.
I think the only chance of an okay future is if this absolute, root-level power is stopped from existing altogether. That somehow power gets spread out enough that the masses can do “continuous realignment” of the power sitting above them, even when the power doesn’t necessarily want to be realigned. I have no idea how to achieve that, but it’s clear that helping governments and corporations get more power (with alignment work or otherwise) is the worst thing to do from this perspective.
To my understanding—and I’m not endorsing this position; quite the contrary—“AI alignment” has generally been taken to mean “Create a superintelligence that will not eradicate humanity entirely in the process of pursuing its goals”.
I raised warnings about this definition earlier, when people were excusing partisan censorship of models on this basis. An AI aligned to, say, half of humanity, might be better than one aligned to none of humanity, but the other half of humanity certainly won’t think so, and that severely impacts the probability of even getting the first half to the finish line, since now you’ve got lots and lots of humans—many of them wealthy, or tech-savvy, or well-armed—who will do whatever it takes to prevent you from winning, because from their position your victory looks the same as every other failure state.
One can argue that the two issues are even more closely intertwined than it would seem. Imagine a world in which Anthropic had gotten out ahead of concerns about their models’ racial biases, and allayed those concerns before, for example, the wealthiest man on Earth found out about it, tweeted out a complaint, and immediately caused the half of America on his side—including the man who presently controls the Executive branch—to become substantially less receptive to anything Anthropic has to say.
I realize a large portion of this site has taken the current flareup as cause (often, excuse) to make AI safety more explicitly political, but I don’t think that’s a winning strategy. All of this shows that either everyone wins or nobody does, because we can’t afford to make human enemies when our situation is grim enough even without any.
I am not sure there ever was a way to tackle all of this together. Obviously “the AI does what we want at all” is the prerequisite to anything else, and we don’t even know if we have that down pat (especially if it gets smarter). But also “bake your specific humanistic tolerant value into the AI before anyone notices so when it fooms they’re forced to deal with a nice genie that won’t obey evil orders” was obviously always very naive as far as plans go. What else? Don’t build AI at all, probably, which in itself would require ugly and likely repressive methods. Or I suppose hope you can at least keep AI tethered to the way the current institutions work, so everyone gets a force multiplier of sorts but balance persists… I would call that a pipe dream too. Honestly I just think what we see is the flailing about of many people tackling different angles of a fundamentally unsolvable tangle of problems and all accusing each other of not seeing the real problem when they’re all real.
Arguably true, but I think there’s a case to be made that sincere kumbaya hippie-ism that’s inoffensive to everybody is more likely to succeed than a more cynical ideology that uses it as a facemask, and is willing to write off its enemies foreign and domestic as adversaries that it’s okay to run the trolley over.
Supposing I’m a Chinese military strategist, I’m much less likely to sound alarm bells over the risk of an American firm building world-dominating AI if that firm has not enthusiastically offered to use its AI to fight my government. Supposing I’m a Republican staffer, I’m much less likely to encourage a scorched-earth approach to bring a contractor to heel if that contractor has actively tried to prevent its systems from discriminating against my constituents.
I should note that this is all independent of the technical details of alignment. Either we get close enough on that and it’s fine, or we don’t and we’re goners anyways. But if you’re Anthropic, then at this point you’ve already committed to the idea that somebody is going to build AI, and you believe that it should be you, and under those conditions, it makes a lot more sense to minimize the number of humans who think that you’d make a god that’s willing to hurt them.
To a point, but I don’t know if “just pull off essentially a worldwide cultural coup by being fast enough to avoid the supervision of any existing political mechanism—for the sake of forever peace and goodness” can be construed as unambiguously ethical either. It sounds more like one of those well-intentioned crazy comic book villain plans that always end bad, and has a decent chance of doing that (a misaligned well-intentioned all-powerful ASI could be a huge S-risk). It can still be construed as virtuous, a final rebellion attempt against a baked in social and political order that one considers fundamentally immoral and unfixable—but it is still an act of rebellious subversion, not just a nice peaceful thing to do.
Anything that explicitly performs tolerance—as Claude does—comes already across as inherently partisan and offensive to some sides. In fact probably a big part of why what happened, happened. Not everyone is just happy to live and let live, some think that if your AI isn’t actively promoting their mindset then it’s not good enough.
To be clear—right now my lab is not helping the government wage the current war in Iran. The OpenAI deployment will be in the future. And I would not say “I am OK” with it. But I would say that if the elected government decides to take an action that I don’t agree with, including waging war, then that’s a whole different matter if the government is trying to use my system to undermine the democratic process and stay in power indefinitely.
Right, that’s what matters to you. And that’s my point—that the circle of “what matters to alignment researchers” has been narrowing. You were supposed to work toward a positive singularity for all humanity. Now you’re saying you’re much more ok with using AI to wage war than undermining democracy within. Basically you’re working toward giving the US government the power to do anything it wants to me (a non-US person) and calling it “alignment”.
Maybe we’ll see an alliance of Butlerian jihadis with the real jihadis!