All good points and I wanted to reply with some of them, so thanks. But there’s also another point where I might disagree more with LW folks (including you and Carl and maybe even Wei): I no longer believe that technological whoopsie is the main risk. I think we have enough geniuses working on the thing that technological whoopsie probably won’t happen. The main risk to me now is that AI gets pretty well aligned to money and power, and then money and power throws most humans by the wayside. I’ve mentioned it many times, the cleanest formulation is probably in this book review.
In that light, Redwood and others are just making better tools for money and power, to help align AI to their ends. Export controls are a tool of international conflict: if they happen, they happen as part of a package of measures which basically intensify the arms race. And even the CAIS letter is now looking to me like a bit of PR move, where Altman and others got to say they cared about risk and then went on increasing risk anyway. Not to mention the other things done by safety-conscious money, like starting OpenAI and Anthropic. You could say the biggest things that safety-conscious money achieved were basically enabling stuff that money and power wanted. So the endgame wouldn’t be some kind of war between humans and AI, it would be AI simply joining up with money and power, and cutting out everyone else.
My response to this is similar to my response to Will MacAskill’s suggestion to work on reducing the risk of AI-enabled coups: I’m pretty worried about this, but equally or even more worried about broader alignment “success”, e.g., if AI was aligned to humanity as a whole, or everyone got their own AI representative, or something like that, because I generally don’t trust humans to have (or end up with) good values by default. See theseposts for some reasons why.
However I think it’s pretty plausible that there’s a technological solution to this (although we’re not on track for achieving it), for example if it’s actually wrong (from their own perspective) for the rich and power to treat everyone else badly, and AIs are designed to be philosophically competent and as a result help their users realize and fix their moral/philosophical mistakes.
Since you don’t seem to think there’s a technological solution to this, what do you envision as a good outcome?
First, I think there’s enough overlap between different reasoning skills that we should expect a smarter than human AI to be really good at most such skills, including philosophy. So this part is ok.
Second, I don’t think philosophical skill alone is enough to figure out the right morality. For example, let’s say you like apples but don’t like oranges. Then when choosing between philosophical theory X, which says apples are better than oranges, and theory Y which says the opposite, you’ll use the pre-theoretic intuition as a tiebreak. And I think when humans do moral philosophy, they often do exactly that: they fall back on pre-theoretic intuitions to check what’s palatable and what isn’t. It’s a tree with many choices, and even big questions like consequentialism vs deontology vs virtue ethics may ultimately depend on many such case by case intuitions, not just pure philosophical reasoning.
Third, I think morality is part of culture. It didn’t come from the nature of an individual person: kids are often cruel. It came from constraints that people put on each other, and cultural generalization of these constraints. “Don’t kill.” When someone gets powerful enough to ignore these constraints, the default outcome we should expect is amorality. “Power corrupts.” Though of course there can be exceptions.
Fourth—and this is the payoff—I think the only good outcome is if the first smarter than human AIs start out with “good” culture, derived from what human societies think is good. Not aligned to an individual human operator, and certainly not to money and power. Then AIs can take it from there and we’ll be ok. But I don’t know how to achieve that. It might require human organizational forms that are not money- or power-seeking. I wrote a question about it sometime ago, but didn’t get any answers.
First, I think there’s enough overlap between different reasoning skills that we should expect a smarter than human AI to be really good at most such skills, including philosophy. So this part is ok.
Supposing this is true, how would you elicit this capability? In other words, how would you train the AI (e.g., what reward signal would you use) to tell humans when they (the humans) are making philosophical mistakes, and present humans with only true philosophical arguments/explanations? (As opposed to presenting the most convincing arguments, which may exploit flaws in human’s psychology or reasoning, or telling the humans what they most want to hear or what’s most likely to get a thumb up or high rating.)
Fourth—and this is the payoff—I think the only good outcome is if the first smarter than human AIs start out with “good” culture, derived from what human societies think is good.
“What human societies think is good” is filled with pretty crazy stuff, like wokeness imposing its skewed moral priorities and empirical beliefs on everyone via “cancel culture”, and religions condemning “sinners” and nonbelievers to eternal torture. Morality is Scary talks about why this is generally the case, why we shouldn’t expect “what human societies think is good” to actually be good.
Also, wouldn’t “power corrupts” apply to humanity as a whole if we manage to solve technical alignment and not align ASI to the current “power and money”? Won’t humanity be the “power and money” post-Singularity, e.g., each human or group of humans will have enough resources to create countless minds and simulations to lord over?
I’m hoping that both problems (“morality is scary” and “power corrupts”) are philosophical errors that have technical solutions in AI design (i.e., AIs can be designed to help humans avoid/fix these errors), but this is highly neglected and seems unlikely to happen by default.
I’m not very confident, but will try to explain where the intuition comes from.
Basically I think the idea of “good” might be completely cultural. As in, if you extrapolate what an individual wants, that’s basically a world optimized for that individual’s selfishness; then there is what groups can agree on by rational negotiation, which is a kind of group selfishness, cutting out everyone who’s weak enough (so for example factory farming would be ok because animals can’t fight back); and on top of that there is the abstract idea of “good”, saying you shouldn’t hurt the weak at all. And that idea is not necessitated by rational negotiation. It’s just a cultural artifact that we ended up with, I’m not sure how.
So if you ask AI to optimize for what individuals want, and go through negotiations and such, there seems a high chance that the resulting world won’t contain “good” at all, only what I called group selfishness. Even if we start with individuals who strongly believe in the cultural idea of good, they can still get corrupted by power. The only way to get “good” is to point AI at the cultural idea to begin with.
You are of course right that culture also contains a lot of nasty stuff. The only way to get something good out of it is with a bunch of extrapolation, philosophy, and yeah I don’t know what else. It’s not reliable. But the starting materials for “good” are contained only there. Hope that makes sense.
Also to your other question: how to train philosophical ability? I think yeah, there isn’t any reliable reward signal, just as there wasn’t for us. The way our philosophical ability seems to work is by learning heuristics and ways of reasoning from fields where verification is possible (like math, or everyday common sense) and applying them to philosophy. And it’s very unreliable of course. So for AIs maybe this kind of carry-over to philosophy is also the best we can hope for.
Thanks for this explanation, it definitely makes your position more understandable.
and on top of that there is the abstract idea of “good”, saying you shouldn’t hurt the weak at all. And that idea is not necessitated by rational negotiation. It’s just a cultural artifact that we ended up with, I’m not sure how.
I can think of 2 ways:
It ended up there the same way that all the “nasty stuff” ended up in our culture, more or less randomly, e.g. through the kind of “morality as status game” talked about in Will Storr’s book, which I quote in Morality is Scary.
It ended up there via philosophical progress, because it’s actually correct in some sense.
If it’s 1, then I’m not sure why extrapolation and philosophy will pick out the “good” and leave the “nasty stuff”. It’s not clear to me why aligning to culture would be better than aligning to individuals in that case.
If it’s 2, then we don’t need to align with culture either—AIs aligned with individuals can rederive the “good” with competent philosophy.
Does this make sense?
So for AIs maybe this kind of carry-over to philosophy is also the best we can hope for.
It seems clear that technical design or training choices can make a difference (but nobody is working on this). Consider the analogy with the US vs Chinese education system, where the US system seems to produce a lot more competence and/or interest in philosophy (relative to STEM) compared to the Chinese system. And comparing humans with LLMs, it sure seems like they’re on track to exceeding (top) human level in STEM while being significantly less competent in philosophy.
As in, if you extrapolate what an individual wants, that’s basically a world optimized for that individual’s selfishness; then there is what groups can agree on by rational negotiation, which is a kind of group selfishness, cutting out everyone who’s weak enough
I think it’s important to frame values around scopes of optimization, not just coalitions of actors. An individual then wants first of all their own life (rather than the world) optimized for that individual’s preferences. If they don’t live alone, their home might have multiple stakeholders, and so their home would be subject to group optimization, and so on.
At each step, optimization is primarily about the shared scope, and excludes most details of the smaller scopes under narrower control enclosed within. Culture and “good” would then have a lot to say about the negotiations on how group optimization takes place, but also about how the smaller enclosed scopes within the group’s purview are to be relatively left alone to their own optimization, under different preferences of corresponding smaller groups or individuals.
It may be good to not cut out everyone who’s too weak to prevent that, as the cultural content defining the rules for doing so is also preference that wants to preserve itself, whatever its origin (such as being culturally developed later than evolution-given psychological drives). And individuals are in particular carriers of culture that’s only relevant for group optimization, so group optimization culture would coordinate them into agreement on some things. I think selfishness is salient as a distinct thing only because the cultural content that concerns group optimization needs actual groups to get activated in practice, and without that activation applying selfishness way out of its scope is about as appropriate as stirring soup with a microscope.
I think this is not obviously qualitatively different from technical oopsie, and sufficiently strong technical success should be able to prevent this. But that’s partially because I think “money and power” is effectively an older, slower AI made of allocating over other minds, and both kinds of AI need to be strongly aligned to flourishing of humans. Fortunately humans with money and power generally want to use it to have nice lives, so on an individual level there should be incentive compatibility if we can find a solution which is general between them. I’m slightly hopeful Richard Ngo’s work might weigh on this, for example.
All good points and I wanted to reply with some of them, so thanks. But there’s also another point where I might disagree more with LW folks (including you and Carl and maybe even Wei): I no longer believe that technological whoopsie is the main risk. I think we have enough geniuses working on the thing that technological whoopsie probably won’t happen. The main risk to me now is that AI gets pretty well aligned to money and power, and then money and power throws most humans by the wayside. I’ve mentioned it many times, the cleanest formulation is probably in this book review.
In that light, Redwood and others are just making better tools for money and power, to help align AI to their ends. Export controls are a tool of international conflict: if they happen, they happen as part of a package of measures which basically intensify the arms race. And even the CAIS letter is now looking to me like a bit of PR move, where Altman and others got to say they cared about risk and then went on increasing risk anyway. Not to mention the other things done by safety-conscious money, like starting OpenAI and Anthropic. You could say the biggest things that safety-conscious money achieved were basically enabling stuff that money and power wanted. So the endgame wouldn’t be some kind of war between humans and AI, it would be AI simply joining up with money and power, and cutting out everyone else.
My response to this is similar to my response to Will MacAskill’s suggestion to work on reducing the risk of AI-enabled coups: I’m pretty worried about this, but equally or even more worried about broader alignment “success”, e.g., if AI was aligned to humanity as a whole, or everyone got their own AI representative, or something like that, because I generally don’t trust humans to have (or end up with) good values by default. See these posts for some reasons why.
However I think it’s pretty plausible that there’s a technological solution to this (although we’re not on track for achieving it), for example if it’s actually wrong (from their own perspective) for the rich and power to treat everyone else badly, and AIs are designed to be philosophically competent and as a result help their users realize and fix their moral/philosophical mistakes.
Since you don’t seem to think there’s a technological solution to this, what do you envision as a good outcome?
It’s complicated.
First, I think there’s enough overlap between different reasoning skills that we should expect a smarter than human AI to be really good at most such skills, including philosophy. So this part is ok.
Second, I don’t think philosophical skill alone is enough to figure out the right morality. For example, let’s say you like apples but don’t like oranges. Then when choosing between philosophical theory X, which says apples are better than oranges, and theory Y which says the opposite, you’ll use the pre-theoretic intuition as a tiebreak. And I think when humans do moral philosophy, they often do exactly that: they fall back on pre-theoretic intuitions to check what’s palatable and what isn’t. It’s a tree with many choices, and even big questions like consequentialism vs deontology vs virtue ethics may ultimately depend on many such case by case intuitions, not just pure philosophical reasoning.
Third, I think morality is part of culture. It didn’t come from the nature of an individual person: kids are often cruel. It came from constraints that people put on each other, and cultural generalization of these constraints. “Don’t kill.” When someone gets powerful enough to ignore these constraints, the default outcome we should expect is amorality. “Power corrupts.” Though of course there can be exceptions.
Fourth—and this is the payoff—I think the only good outcome is if the first smarter than human AIs start out with “good” culture, derived from what human societies think is good. Not aligned to an individual human operator, and certainly not to money and power. Then AIs can take it from there and we’ll be ok. But I don’t know how to achieve that. It might require human organizational forms that are not money- or power-seeking. I wrote a question about it sometime ago, but didn’t get any answers.
Supposing this is true, how would you elicit this capability? In other words, how would you train the AI (e.g., what reward signal would you use) to tell humans when they (the humans) are making philosophical mistakes, and present humans with only true philosophical arguments/explanations? (As opposed to presenting the most convincing arguments, which may exploit flaws in human’s psychology or reasoning, or telling the humans what they most want to hear or what’s most likely to get a thumb up or high rating.)
“What human societies think is good” is filled with pretty crazy stuff, like wokeness imposing its skewed moral priorities and empirical beliefs on everyone via “cancel culture”, and religions condemning “sinners” and nonbelievers to eternal torture. Morality is Scary talks about why this is generally the case, why we shouldn’t expect “what human societies think is good” to actually be good.
Also, wouldn’t “power corrupts” apply to humanity as a whole if we manage to solve technical alignment and not align ASI to the current “power and money”? Won’t humanity be the “power and money” post-Singularity, e.g., each human or group of humans will have enough resources to create countless minds and simulations to lord over?
I’m hoping that both problems (“morality is scary” and “power corrupts”) are philosophical errors that have technical solutions in AI design (i.e., AIs can be designed to help humans avoid/fix these errors), but this is highly neglected and seems unlikely to happen by default.
I’m not very confident, but will try to explain where the intuition comes from.
Basically I think the idea of “good” might be completely cultural. As in, if you extrapolate what an individual wants, that’s basically a world optimized for that individual’s selfishness; then there is what groups can agree on by rational negotiation, which is a kind of group selfishness, cutting out everyone who’s weak enough (so for example factory farming would be ok because animals can’t fight back); and on top of that there is the abstract idea of “good”, saying you shouldn’t hurt the weak at all. And that idea is not necessitated by rational negotiation. It’s just a cultural artifact that we ended up with, I’m not sure how.
So if you ask AI to optimize for what individuals want, and go through negotiations and such, there seems a high chance that the resulting world won’t contain “good” at all, only what I called group selfishness. Even if we start with individuals who strongly believe in the cultural idea of good, they can still get corrupted by power. The only way to get “good” is to point AI at the cultural idea to begin with.
You are of course right that culture also contains a lot of nasty stuff. The only way to get something good out of it is with a bunch of extrapolation, philosophy, and yeah I don’t know what else. It’s not reliable. But the starting materials for “good” are contained only there. Hope that makes sense.
Also to your other question: how to train philosophical ability? I think yeah, there isn’t any reliable reward signal, just as there wasn’t for us. The way our philosophical ability seems to work is by learning heuristics and ways of reasoning from fields where verification is possible (like math, or everyday common sense) and applying them to philosophy. And it’s very unreliable of course. So for AIs maybe this kind of carry-over to philosophy is also the best we can hope for.
Thanks for this explanation, it definitely makes your position more understandable.
I can think of 2 ways:
It ended up there the same way that all the “nasty stuff” ended up in our culture, more or less randomly, e.g. through the kind of “morality as status game” talked about in Will Storr’s book, which I quote in Morality is Scary.
It ended up there via philosophical progress, because it’s actually correct in some sense.
If it’s 1, then I’m not sure why extrapolation and philosophy will pick out the “good” and leave the “nasty stuff”. It’s not clear to me why aligning to culture would be better than aligning to individuals in that case.
If it’s 2, then we don’t need to align with culture either—AIs aligned with individuals can rederive the “good” with competent philosophy.
Does this make sense?
It seems clear that technical design or training choices can make a difference (but nobody is working on this). Consider the analogy with the US vs Chinese education system, where the US system seems to produce a lot more competence and/or interest in philosophy (relative to STEM) compared to the Chinese system. And comparing humans with LLMs, it sure seems like they’re on track to exceeding (top) human level in STEM while being significantly less competent in philosophy.
I think it’s important to frame values around scopes of optimization, not just coalitions of actors. An individual then wants first of all their own life (rather than the world) optimized for that individual’s preferences. If they don’t live alone, their home might have multiple stakeholders, and so their home would be subject to group optimization, and so on.
At each step, optimization is primarily about the shared scope, and excludes most details of the smaller scopes under narrower control enclosed within. Culture and “good” would then have a lot to say about the negotiations on how group optimization takes place, but also about how the smaller enclosed scopes within the group’s purview are to be relatively left alone to their own optimization, under different preferences of corresponding smaller groups or individuals.
It may be good to not cut out everyone who’s too weak to prevent that, as the cultural content defining the rules for doing so is also preference that wants to preserve itself, whatever its origin (such as being culturally developed later than evolution-given psychological drives). And individuals are in particular carriers of culture that’s only relevant for group optimization, so group optimization culture would coordinate them into agreement on some things. I think selfishness is salient as a distinct thing only because the cultural content that concerns group optimization needs actual groups to get activated in practice, and without that activation applying selfishness way out of its scope is about as appropriate as stirring soup with a microscope.
I think this is not obviously qualitatively different from technical oopsie, and sufficiently strong technical success should be able to prevent this. But that’s partially because I think “money and power” is effectively an older, slower AI made of allocating over other minds, and both kinds of AI need to be strongly aligned to flourishing of humans. Fortunately humans with money and power generally want to use it to have nice lives, so on an individual level there should be incentive compatibility if we can find a solution which is general between them. I’m slightly hopeful Richard Ngo’s work might weigh on this, for example.