I find it a bit weird that this argument needs to be made at all. But it does, in current company at least.
One argument for current company is: have you actually met people? Or just your EA friends, some of the sweetest (and noncoincidentally most privileged) people to ever exist?
Outside of the EA sphere, I doubt people would be as drawn to the idea that maybe minds are safe and nice by default. There’s only a vague and arguable tendency for smarter people to be nicer. And even the nice smart people aren’t that nice. And most of history looks suffused with ruthless sociopathy to my eye.
I suspect most humans would be more likely to agree with “smarter humans are more ruthless”. And they’d be very suspicious of the argument “maybe intelligence produces niceness! So let’s make our alien LLMs way smarter and see!”.
I’d sort of like to give humanity as a whole more of a vote on whether we develop AGI as fast as humanly possible, because I think their intuitions would trend in the right directions. It takes a lot of privilege to think that minds are nice by default, according to my understanding of history and the current state of the world.
The other usual argument is “well LLMs don’t really have goals so that should be fine”. And maybe it would be, if developers weren’t busily making them more goal-directed! Assuming they’ll only work on the goals that authorized, wise individuals give them seems pretty optimistic. So does assuming that their current prosocial tendencies keep on counterbalancing their goal-directedness in a way you’ll like. This seems like an unlikely default If they ever do self-directed learning and so change outside of human control (another thing developers are excited to work on!).
This is essentially agreeing with you that LLMs won’t reach superintelligence in their current adorably goalless but also incompetent state.
I also composed a whole argument along the lines of minds in general are ruthless and sociopathic because caring about others and not getting stuff done even when you’re competent and even vaguely goal-directed are special properties that must be carefully designed. But I’ll develop that and present it elsewhere, because it’s similar in form and less well phrased than your argument.
And most of history looks suffused with ruthless sociopathy to my eye.
This is the part that always confuses me about “alignment”, and it boils down to “aligned to who?”
The AI lab CEOs? I wouldn’t trust most of them with control of a superhuman intelligence. On his best day, Dario Amodei looks like the protagonist of a Greek tragedy about to be destroyed by the gods. And I wouldn’t trust Sam Altman with my lunch money.
The government? I’m sure the government can be trusted with control of a superhuman intelligence. /s
The voting public? I’m sure this is really fun, unless you’re trans, or an immigrant, or belong the Out Group. In which case, an AI aligned to the popular vote means you’re going to get “cured” of whatever society doesn’t like this year.
Now, I happen to personally believe that alignement of superintelligent, learning, goal-seeking entities is impossible. Not “difficult” or “it might take decades”, but flat out impossible. An AI might like humans enough to keep us as pets, but that will be the AI’s decision, not ours. Dogs have approximately no control over their relationship with humans, and I figure that “humans as house pets” is the absolute best possible result of building superintelligent AI. My P(not doom|someone builds superintelligence) is about 1⁄6, and nearly all of that 1⁄6 is placed on “humans as house pets.”
But if we could control AIs? Those AIs would be controlled by powerful humans, the same sorts of people who had warm personal relationships with Epstein, and who had zero problems with Epstein trafficking and raping children. Given a choice between “superhuman AIs aligned to the Epstein class” and getting paperclipped, I’d go with the paperclips.
The only winning move is not to build superintelligence.
In short, I have a view of human nature that’s somewhat more optimistic than yours.
I don’t think leaving humans in charge of the world is obviously a win either. It does look to me like the arc of history is bending toward justive, but it’s happening slowly and in fits and starts. And we could all be dead before we get to a stable just society. This isn’t really an argument for building ASI; I think we probably shouldn’t, or at least not this fast.
But it looks like we’re going to.
The big advantage of building intent-aligned AGI (if we can avoid do that instead of build a misaligned ASI that kills us all) is that it makes being good to people vastly easier, essentially completely free. You just tell your ASI “okay fine, go make the world better for people. Tell me how you’d do it and I’ll choose some options”.
This lowers the bar for how good someone has to be to benefit the humanity to just above zero. If they have more inclination to be helpful than harmful, that’s all it takes.
No human who’s ever lived has been in that position. Even the most powerful have had to worry about losing their power, and about themselves and their loved ones dying painfully and fairly soon.
So strangely, I wouldn’t trust Sam Altman with my lunch money, but I would guess he’d probably produce a very good future if he were to wind up god-emperor for eternity. The exposes I’ve seen don’t claim he’s a particularly vengeful person. We’ll just have to celebrate Samday every week :)
There are individuals with what I think of as a negative empathy-sadism balance, but they’re pretty rare. Sociopathic individuals do seem to be overrepresented in the halls of power, but even there I think we’ve got pretty good odds of minimally good people winding up in charge of ASI.
This is not a scenario I’m comfortable with If a sadistic individual gets control of the future, it could be worse than death, a permanent state of suffering. But it would take both a very selfless and competent person to launch such a thing successfully. . I’d almost rather see an attempt at value-aligned AGI.
I’m not sure how to up our odds; getting good people into power is an old challenge and I don’t know of new methods to improve our odds.
What do you think of the Meaning Alignment Institute’s (MAI) “democratic fine-tuning (DFT)” work on eliciting moral graphs from populations? e.g. this post from Oct ’23 (primer here):
We report on the first run of “Democratic Fine-Tuning” (DFT), funded by OpenAI. DFT is a democratic process that surfaces the “wisest” moral intuitions of a large population, compiled into a structure we call the “moral graph”, which can be used for LLM alignment.
We show bridging effects of our new democratic process. 500 participants were sampled to represent the US population. We focused on divisive topics, like how and if an LLM chatbot should respond in situations like when a user requests abortion advice. We found that Republicans and Democrats come to agreement on values it should use to respond, despite having different views about abortion itself.
We present the first moral graph, generated by this sample of Americans, capturing agreement on LLM values despite diverse backgrounds.
We present good news about their experience: 71% of participants said the process clarified their thinking, and 75% gained substantial respect for those across the political divide.
Finally, we’ll say why moral graphs are better targets for alignment than constitutions or simple rules like HHH. We’ll suggest advantages of moral graphs in safety, scalability, oversight, interpretability, moral depth, and robustness to conflict and manipulation.
and their more recent full-stack alignment vision? I ask because I’ve asked myself the same exact question, and MAI’s actual DFT above getting Reps and Dems to agree on hot-button questions seemed like the only line of work getting concrete results.
That said, I do lean towards your “the only winning move is not to build superintelligence” take, I suspect because I was born and raised in a country that until a few decades ago was a British colony, so I am biased to view your threat model description as obviously correct. So I’m guessing your answer to my question above is “who cares what MAI is working on, aligning ASI is impossible”?
What do you think of the Meaning Alignment Institute’s (MAI) “democratic fine-tuning (DFT)” work on eliciting moral graphs from populations?
Interesting! I will need to read through this in more detail, to get an idea of their approach. I’m glad someone is trying to do something in this space.
My objection to other approaches of democratic governance tend to break down roughly as follows:
I fear that democratic governance of superintelligence about as likely to succeed as chimpanzees coming up with elaborate schemes to democratically manage Homo sapiens for the benefit of chimps. No matter how careful and clever the chimps are, they’re going to fail. They don’t even understand 99% of what’s going on, so how could they hope to manage it?
We will not, in practice, actually attempt any such governance scheme. The Chinese labs won’t, because China doesn’t even believe in Western notions of democracy and human rights. OpenAI has recently gutted its existing non-profit governance structure in order the reduce the risk of anyone attempting to govern it. Anthropic, out of all the labs, just might try. But the US government is currently trying to break Anthropic and bring them to heel by threatening to designate them as a supply chain risk (like Huawei) unless they agree to support “all legal uses,” potentially including things like fully autonomous killbots and domestic surveillance. The “supply chain risk” designation, as I understand it, would mean that no Anthropic customer would be allowed to do business with the US government. Perhaps I’ve misunderstood this specific situation, but in the end, Anthropic is subject to the people with the guns. And the people with the guns do not necessarily want democratic oversight. So in practice, no, the billionaires and politicians will almost certainly not agree to some clever democratic governance system.
Even if we could somehow control superintelligence and if we could somehow place it under democratic control, I don’t especially trust democratic control. Why? Well, I’m bi, my friends are trans, and I’m old enough to remember the 1980s. Had someone proposed a plan like, “LGBT+ people are mentally ill, and we can cure them by nonconsensually rewriting their minds,” it’s entirely possible that the public might have voted for that.
Finally, democracy is inherently unstable. About 20-25% of people appear to be “authoritarian followers”, which means they’re pretty happy to vote for a strongman. This number increases in times of fear and crisis. (It went up after 9/11, for example.) And another big chunk of the population can be moved by propaganda, or barely understand anything at all about politics. So historically, a number of 20th century democratic nations voted in the leaders who destroyed their democracy. This can be fixed; Germany is a democracy again today. But I expect democratic governance of superintelligence would be subject to similar risks, and in the case of superintelligence, you may not be able to fix your mistakes.
So a plan like MAI’s is crtically dependant on a number of assumptions:
We can control superintelligence.
We have sufficiently good democratic control over the rich and the powerful to make sure they don’t wind up controlling superintelligence.
If the people do succeed in getting democratic control over superintelligence, they won’t vote it away, and they won’t democratically decide to horrible things to unpopular minorities.
So from my perspective, MAI’s plan is a “hail Mary” plan. But we’re pretty deep in “hail Mary” territory, so I’m not opposed to placing bets on what look like unlikely outcomes.
Similarly, as far as I can tell, Dario Amodei’s current plan for Anthropic is “build superintelligence as fast as we can, do our very best to make it like humans, and expect to totally lose all human control within 5-20 years.” Personally, I feel like this is the least horrible version of the worst idea in human history. Like, obviously, no, we should not do this. But if we’re going to do this, Anthropic is at least thinking about the real issues. They know that humans are likely to lose control, but they’re basically hoping we can wind up as beloved house pets.
I still think the best plan is “just don’t build something vastly smarter than us with the ability to learn,[1] pursue goals and replicate.” One obvious objection to my plan is that we’re probably going to go right ahead and build superintelligence anyway. Which is why I am sympathetic to long-shot plans that might have an outside chance of working.
But I still prefer “just don’t build superintelligence.” Or, failing that, delay it. Emotionally, I’m treating it sort of like a diagnosis of terminal cancer for me and everyone I love. Even a remission of several years would be of immense value. And delay also gives some of the hail Mary plans a slightly better chance of working, or of the public realizing that maybe they don’t want to be “beloved house pets” of minds no human can possibly understand.
Learning is essentially a form of self-modification. Combined with differential replication of more successful entities, this gives you natural selection.
Yeah. The true nature of power is shown by more horrors of history than can be counted, but Epstein and factory farming are especially illustrative to me.
I’d sort of like to give humanity as a whole more of a vote on whether we develop AGI as fast as humanly possible, because I think their intuitions would trend in the right directions.
Well, would you say this if their intuitions were tending in the wrong directions?
The claim isn’t that minds are safe and nice by default. It’s that they’re not sociopaths.
If in your view, most humans are basically ruthless sociopaths, then that’s good news, isn’t it? Sociopathic AIs would fit in well to our culture. It would mean our laws and norms do a remarkably good job of restraining us, so there’d be hope they’d do the same for future AIs.
What I mean by “nice” is roughly the opposite of being a ruthless sociopath. It means treating other sentient beings well for its own sake.
Most humans are definitely not ruthless sociopaths. Sociopaths are estimated at about 10% of the population. And most of those aren’t even that ruthless; I think it’s a spectrum, like all biological mental differences. This leaves the conclusion that even NON-sociopathic humans are often pretty ruthless when they can get away with it, like when they hold a lot of power. But that’s pretty much beside the main point here, which is that we shouldn’t expect nice/non-ruthless behavior by default.
Laws and norms are not going to restrain an AGI that can route around them easily when it’s smart enough. They barely restrain human sociopaths who are bad at routing around them.
You might envision AGIs enforcing laws that include human well-being, and creating a nice just society roughly for the same reasons we have now. But if those AGIs don’t genuinely care about humans, I’d strongly expect humans to soon occupy the legal positions of farm animals—or perhaps worse, the many species we’ve extincted even though we like them, because they don’t provide our society any real benefit, and we wanted to do other stuff with their habitats.
That’s why we’re kind of obsessed with aligning smarter than human AI so it genuinely and intrinsically cares about us, or at least reliably takes our orders as intended.
The claim isn’t that minds are safe and nice by default. It’s that they’re not sociopaths.
I thought one of the tenets of this debate is that there’s no in-between. Either safe and nice (aligned) or everybody dies (not aligned). Humans are a good example—most are not pure psychopaths, and yet they do a ton of harm to each other all the time, and have threatened to destroy the species for decades. A set of much more powerful minds with even that level of misalignment would be disaster, and if they’re slightly worse than humans, so much the worse.
I find it a bit weird that this argument needs to be made at all. But it does, in current company at least.
One argument for current company is: have you actually met people? Or just your EA friends, some of the sweetest (and noncoincidentally most privileged) people to ever exist?
Outside of the EA sphere, I doubt people would be as drawn to the idea that maybe minds are safe and nice by default. There’s only a vague and arguable tendency for smarter people to be nicer. And even the nice smart people aren’t that nice. And most of history looks suffused with ruthless sociopathy to my eye.
I suspect most humans would be more likely to agree with “smarter humans are more ruthless”. And they’d be very suspicious of the argument “maybe intelligence produces niceness! So let’s make our alien LLMs way smarter and see!”.
I’d sort of like to give humanity as a whole more of a vote on whether we develop AGI as fast as humanly possible, because I think their intuitions would trend in the right directions. It takes a lot of privilege to think that minds are nice by default, according to my understanding of history and the current state of the world.
The other usual argument is “well LLMs don’t really have goals so that should be fine”. And maybe it would be, if developers weren’t busily making them more goal-directed! Assuming they’ll only work on the goals that authorized, wise individuals give them seems pretty optimistic. So does assuming that their current prosocial tendencies keep on counterbalancing their goal-directedness in a way you’ll like. This seems like an unlikely default If they ever do self-directed learning and so change outside of human control (another thing developers are excited to work on!).
This is essentially agreeing with you that LLMs won’t reach superintelligence in their current adorably goalless but also incompetent state.
I also composed a whole argument along the lines of minds in general are ruthless and sociopathic because caring about others and not getting stuff done even when you’re competent and even vaguely goal-directed are special properties that must be carefully designed. But I’ll develop that and present it elsewhere, because it’s similar in form and less well phrased than your argument.
This is the part that always confuses me about “alignment”, and it boils down to “aligned to who?”
The AI lab CEOs? I wouldn’t trust most of them with control of a superhuman intelligence. On his best day, Dario Amodei looks like the protagonist of a Greek tragedy about to be destroyed by the gods. And I wouldn’t trust Sam Altman with my lunch money.
The government? I’m sure the government can be trusted with control of a superhuman intelligence. /s
The voting public? I’m sure this is really fun, unless you’re trans, or an immigrant, or belong the Out Group. In which case, an AI aligned to the popular vote means you’re going to get “cured” of whatever society doesn’t like this year.
Now, I happen to personally believe that alignement of superintelligent, learning, goal-seeking entities is impossible. Not “difficult” or “it might take decades”, but flat out impossible. An AI might like humans enough to keep us as pets, but that will be the AI’s decision, not ours. Dogs have approximately no control over their relationship with humans, and I figure that “humans as house pets” is the absolute best possible result of building superintelligent AI. My P(not doom|someone builds superintelligence) is about 1⁄6, and nearly all of that 1⁄6 is placed on “humans as house pets.”
But if we could control AIs? Those AIs would be controlled by powerful humans, the same sorts of people who had warm personal relationships with Epstein, and who had zero problems with Epstein trafficking and raping children. Given a choice between “superhuman AIs aligned to the Epstein class” and getting paperclipped, I’d go with the paperclips.
The only winning move is not to build superintelligence.
These are serious questions.
In short, I have a view of human nature that’s somewhat more optimistic than yours.
I don’t think leaving humans in charge of the world is obviously a win either. It does look to me like the arc of history is bending toward justive, but it’s happening slowly and in fits and starts. And we could all be dead before we get to a stable just society. This isn’t really an argument for building ASI; I think we probably shouldn’t, or at least not this fast.
But it looks like we’re going to.
The big advantage of building intent-aligned AGI (if we can avoid do that instead of build a misaligned ASI that kills us all) is that it makes being good to people vastly easier, essentially completely free. You just tell your ASI “okay fine, go make the world better for people. Tell me how you’d do it and I’ll choose some options”.
This lowers the bar for how good someone has to be to benefit the humanity to just above zero. If they have more inclination to be helpful than harmful, that’s all it takes.
No human who’s ever lived has been in that position. Even the most powerful have had to worry about losing their power, and about themselves and their loved ones dying painfully and fairly soon.
So strangely, I wouldn’t trust Sam Altman with my lunch money, but I would guess he’d probably produce a very good future if he were to wind up god-emperor for eternity. The exposes I’ve seen don’t claim he’s a particularly vengeful person. We’ll just have to celebrate Samday every week :)
There are individuals with what I think of as a negative empathy-sadism balance, but they’re pretty rare. Sociopathic individuals do seem to be overrepresented in the halls of power, but even there I think we’ve got pretty good odds of minimally good people winding up in charge of ASI.
This is not a scenario I’m comfortable with If a sadistic individual gets control of the future, it could be worse than death, a permanent state of suffering. But it would take both a very selfless and competent person to launch such a thing successfully. . I’d almost rather see an attempt at value-aligned AGI.
I’m not sure how to up our odds; getting good people into power is an old challenge and I don’t know of new methods to improve our odds.
What do you think of the Meaning Alignment Institute’s (MAI) “democratic fine-tuning (DFT)” work on eliciting moral graphs from populations? e.g. this post from Oct ’23 (primer here):
and their more recent full-stack alignment vision? I ask because I’ve asked myself the same exact question, and MAI’s actual DFT above getting Reps and Dems to agree on hot-button questions seemed like the only line of work getting concrete results.
That said, I do lean towards your “the only winning move is not to build superintelligence” take, I suspect because I was born and raised in a country that until a few decades ago was a British colony, so I am biased to view your threat model description as obviously correct. So I’m guessing your answer to my question above is “who cares what MAI is working on, aligning ASI is impossible”?
Interesting! I will need to read through this in more detail, to get an idea of their approach. I’m glad someone is trying to do something in this space.
My objection to other approaches of democratic governance tend to break down roughly as follows:
I fear that democratic governance of superintelligence about as likely to succeed as chimpanzees coming up with elaborate schemes to democratically manage Homo sapiens for the benefit of chimps. No matter how careful and clever the chimps are, they’re going to fail. They don’t even understand 99% of what’s going on, so how could they hope to manage it?
We will not, in practice, actually attempt any such governance scheme. The Chinese labs won’t, because China doesn’t even believe in Western notions of democracy and human rights. OpenAI has recently gutted its existing non-profit governance structure in order the reduce the risk of anyone attempting to govern it. Anthropic, out of all the labs, just might try. But the US government is currently trying to break Anthropic and bring them to heel by threatening to designate them as a supply chain risk (like Huawei) unless they agree to support “all legal uses,” potentially including things like fully autonomous killbots and domestic surveillance. The “supply chain risk” designation, as I understand it, would mean that no Anthropic customer would be allowed to do business with the US government. Perhaps I’ve misunderstood this specific situation, but in the end, Anthropic is subject to the people with the guns. And the people with the guns do not necessarily want democratic oversight. So in practice, no, the billionaires and politicians will almost certainly not agree to some clever democratic governance system.
Even if we could somehow control superintelligence and if we could somehow place it under democratic control, I don’t especially trust democratic control. Why? Well, I’m bi, my friends are trans, and I’m old enough to remember the 1980s. Had someone proposed a plan like, “LGBT+ people are mentally ill, and we can cure them by nonconsensually rewriting their minds,” it’s entirely possible that the public might have voted for that.
Finally, democracy is inherently unstable. About 20-25% of people appear to be “authoritarian followers”, which means they’re pretty happy to vote for a strongman. This number increases in times of fear and crisis. (It went up after 9/11, for example.) And another big chunk of the population can be moved by propaganda, or barely understand anything at all about politics. So historically, a number of 20th century democratic nations voted in the leaders who destroyed their democracy. This can be fixed; Germany is a democracy again today. But I expect democratic governance of superintelligence would be subject to similar risks, and in the case of superintelligence, you may not be able to fix your mistakes.
So a plan like MAI’s is crtically dependant on a number of assumptions:
We can control superintelligence.
We have sufficiently good democratic control over the rich and the powerful to make sure they don’t wind up controlling superintelligence.
If the people do succeed in getting democratic control over superintelligence, they won’t vote it away, and they won’t democratically decide to horrible things to unpopular minorities.
So from my perspective, MAI’s plan is a “hail Mary” plan. But we’re pretty deep in “hail Mary” territory, so I’m not opposed to placing bets on what look like unlikely outcomes.
Similarly, as far as I can tell, Dario Amodei’s current plan for Anthropic is “build superintelligence as fast as we can, do our very best to make it like humans, and expect to totally lose all human control within 5-20 years.” Personally, I feel like this is the least horrible version of the worst idea in human history. Like, obviously, no, we should not do this. But if we’re going to do this, Anthropic is at least thinking about the real issues. They know that humans are likely to lose control, but they’re basically hoping we can wind up as beloved house pets.
I still think the best plan is “just don’t build something vastly smarter than us with the ability to learn, [1] pursue goals and replicate.” One obvious objection to my plan is that we’re probably going to go right ahead and build superintelligence anyway. Which is why I am sympathetic to long-shot plans that might have an outside chance of working.
But I still prefer “just don’t build superintelligence.” Or, failing that, delay it. Emotionally, I’m treating it sort of like a diagnosis of terminal cancer for me and everyone I love. Even a remission of several years would be of immense value. And delay also gives some of the hail Mary plans a slightly better chance of working, or of the public realizing that maybe they don’t want to be “beloved house pets” of minds no human can possibly understand.
Learning is essentially a form of self-modification. Combined with differential replication of more successful entities, this gives you natural selection.
Yeah. The true nature of power is shown by more horrors of history than can be counted, but Epstein and factory farming are especially illustrative to me.
Well, would you say this if their intuitions were tending in the wrong directions?
The claim isn’t that minds are safe and nice by default. It’s that they’re not sociopaths.
If in your view, most humans are basically ruthless sociopaths, then that’s good news, isn’t it? Sociopathic AIs would fit in well to our culture. It would mean our laws and norms do a remarkably good job of restraining us, so there’d be hope they’d do the same for future AIs.
What I mean by “nice” is roughly the opposite of being a ruthless sociopath. It means treating other sentient beings well for its own sake.
Most humans are definitely not ruthless sociopaths. Sociopaths are estimated at about 10% of the population. And most of those aren’t even that ruthless; I think it’s a spectrum, like all biological mental differences. This leaves the conclusion that even NON-sociopathic humans are often pretty ruthless when they can get away with it, like when they hold a lot of power. But that’s pretty much beside the main point here, which is that we shouldn’t expect nice/non-ruthless behavior by default.
Laws and norms are not going to restrain an AGI that can route around them easily when it’s smart enough. They barely restrain human sociopaths who are bad at routing around them.
You might envision AGIs enforcing laws that include human well-being, and creating a nice just society roughly for the same reasons we have now. But if those AGIs don’t genuinely care about humans, I’d strongly expect humans to soon occupy the legal positions of farm animals—or perhaps worse, the many species we’ve extincted even though we like them, because they don’t provide our society any real benefit, and we wanted to do other stuff with their habitats.
That’s why we’re kind of obsessed with aligning smarter than human AI so it genuinely and intrinsically cares about us, or at least reliably takes our orders as intended.
I thought one of the tenets of this debate is that there’s no in-between. Either safe and nice (aligned) or everybody dies (not aligned). Humans are a good example—most are not pure psychopaths, and yet they do a ton of harm to each other all the time, and have threatened to destroy the species for decades. A set of much more powerful minds with even that level of misalignment would be disaster, and if they’re slightly worse than humans, so much the worse.