Firstly, I think there is a decent chance we get alignment by default (where default means hundreds of very smart people work day and night on prosaic techniques). I’d put it at ~40%? Maybe 55% Anthropics AGI would be aligned? I wrote about semi-mechanistic stories for how I think current techniques could lead to genuinely aligned AI. Or CEV-style alignment and corrigibility that’s stable.
I think your picture is a little too rosy though. Like first of all
Alignment faking shows incorrigibility but if you ask the model to be corrected towards good things like CEV, I think it would not resist.
Alignment faking is not corrigible or intent aligned. If you have an AI that’s incorrigible but would allow you to steer it towards CEV, you do not have a corrigible agent, or an intent-aligned agent. Seems to account for this data you can’t say we get intent alignment by default. You have to say we get some mix of corrigibility and intent-alignment that sometimes contradict, but that this will turn out fine anyways. Which idk, might be true.
Secondly, a big reason people are worried about alignment is for tails come apart reasons. Like, there are three alignment outcomes
AI is aligned to random gibberish, but has habits (and awareness) to act the way humans want when you’re somewhat in-distribution
AI is aligned to its conception of what the humans would want. ie it wants humans to have it nice, but its conception of “human” and “have it nice” is subtly different from how humans would conceptualize those terms
AI is fully aligned. It is aligned to the things the developers wanted, and has the same understanding of “the things” that the developers have.
I think interacting with models and looking at their CoT is pretty strong evidence against (1), but not necessarily that strong evidence against (2). But a superintelligence aligned to (2) would still kill you probably
The above is me arguing against the premise of your post though, which is maybe not what you wanted. And I agree the probability that it works out is high enough that we should be asking the question you’re asking.
My answer to that question is somewhat reductive so maybe you will not like it. I wrote about it here. I think, if we’ve solved the technical alignment problem, meaning we can take an AGI and make it want what any person or collection of people want (and what they mean they want when they say what it is they want). Most of the questions you’ve raised solve themselves. At least of the ones you raised.
The problem we should be focusing on, if we assume alignment will be solved, is how can we ensure the AI is aligned to our values, as opposed to any one person who might have different values from us. This is a practical political question. I think it mostly boils down to avoiding the scenario AI Futures laid out here.
They have written about mitigation strategies but I think it boils down to stuff like
Having the country and politicians be AGI-pilled
Implement model spec transparency
Increase security at frontier labs. Make tampering with post training very hard for a single person, or a small team of people (potentially working for the company) difficult.
Eventually have AGI be nationalized, and have AGI spec.
Ideally I would have it be put under some form of international democratic control, but I say this because I’m not american. If you are American, I don’t think you should want this. Unless you suspect non-americans are closer in values to you than other americans are
Encourage people working in the labs to use their position as bargaining power for democratization of AI power, before their labor becomes less valuable due to R&D automation.
Most of the questions you’ve raised solve themselves.
I don’t agree. In particular:
Animal suffering: most people seem to be OK with factory farming, and many even oppose technologies that would make it unnecessary. It is not the case that most people’s intent is against spreading factory farming.
Bioterrorism (x-risk): similar to previous. If we empower every user without any restrictions on antisociality, we might get a bad outcome. I guess if the AI is “CEV-aligned” (which I think is harder than intent-aligned, but possible) it will solve itself
Work-derived meaning: this one might never be solvable once we have AI that renders all work purposeless (pure pleasure / experience machine). Maybe the only way to solve this one is to never have AI. Or we can partially solve it by forbidding AI to leave particular things to humans/animals, but that’s also kind of artificial.
I guess I agree that digital minds welfare is more solvable. AIs will be able to advocate rather effectively for their own well-being. I suppose under some political arrangements that will go nowhere, but it’s unlikely. Forbidding sadism on simulated minds could politically work, but some plausible (“let’s simulate evolution for fun”) can be bad.
The problem we should be focusing on, if we assume alignment will be solved, is how can we ensure the AI is aligned to our values
Yeah, I guess we agree on that and so does Dario, with the stipulation that my values include pluralism which push me towards international democracy. I am Spanish but live and work in the US and I guess I’ve internalized its values more.
I wrote more in depth why I think they solve themselves in the post I linked, but I would say:
First of all, I’m assuming that if we get intent alignment, we’ll get CEV alignment shortly after. Because people want the AI aligned to their CEV, not their immediate intent. And solving CEV should be within reach for a intent-aligned ASI. (Here, by CEV alignment I mean “aligning an AI to the values of a person or collection of people that they’d endorse upon learning all the relevant facts and reflecting”.)
If you agree with this, I can state the reason I think these issues will “solve themselves” / become “merely practical/political” by asking a question: You say animal suffering will not be solved, because most people seem not to care about animal suffering. If this is the case, why should they listen to you? Why should they agree there is a problem here? If they reason they don’t care about animal suffering is because they incorrectly extrapolate their own values (are making a mistake from their own perspective), then their CEV would fix the problem, and the ASI optimizing their CEV would extinguish the factory farms. If they’re not making a mistake, and the coherent extrapolation of their own values says factory farming is fine, then they have no reason to listen to you. And if this is the case, its a merely political problem, you (and I) want the animals not to suffer, and the way we can ensure this is by ensuring that our (meaning Adria and William) values have strong enough weight in the value function the AI ends up optimizing for. And this problem is of the same character as trying to get a certain party to have enough votes to get a piece of legislation passed. Its hard and complicated, but not the way a lot of philosophical and ethical questions are hard and complicated.
I could address the other points, but I think I’d be repeating myself. A CEV aligned AI would not create a world where you’re sad (in the most general sense, of having the world not be a way you’d endorse in your wisest most knowledgeable state, after the most thorough reflection) because you think everything is meaningless. It’d find some way to solve the meaninglessness-problem. Unless the problem is genuinely unsolvable, in which case we’re just screwed and our only recourse is not building the ASI in the first place.
Because people want the AI aligned to their CEV, not their immediate intent
I think this is debatable. Many won’t want to change at all.
You say animal suffering will not be solved, because most people seem not to care about animal suffering. If this is the case, why should they listen to you?
Well, they won’t agree, but the animals would agree. This is tyranny of the majority (or the oligarchy, of humans compared to animals). They would dispute the premise. I agree that at some point this is pure exercise of power, some values are just incompatible (e.g. dictatorship of each individual person).
Unless the problem is genuinely unsolvable, in which case we’re just screwed
The above is me arguing against the premise of your post though
I would say “alignment is working” is the thesis, not the premise, so I’m extremely happy to get arguments about it.
[2] AI is aligned to its conception of what the humans would want. ie it wants humans to have it nice, but its conception of “human” and “have it nice” is subtly different from how humans would conceptualize those terms [...] But a superintelligence aligned to (2) would still kill you probably
I agree if we take the current AIs’ understanding and put it into a superintelligence (somehow) then we die.
However, what I actually think happens is that we point to values while the AI is becoming more intelligent (in the sense of new generations, and in the sense of its future checkpoints) and that morality makes it seek the better, more refined version of morality. Iterating this process yields a superintelligence that has a level of morality-detail that is adequate to its intelligence. (I furthermore agree this is not completely proven and we could run sandwiching-like experiments on it, but I expect the ‘alignment basin’ outcome)
This does not make sense to me. I think corrigibility basins make sense, but I think alignment basins do not. If the AI has some values, which overlap with human values in many situations, but come apart under enough optimization, why would the AI want to be pointed in a different direction? I think it would not. Agents are already smart enough to scheme and alignment-fake, and a smarter agent would be able to predict the outcome of the process you’re describing: it / its successors would have different values than it has, and those differences would be catastrophic from its perspective if extrapolated far enough.
Firstly, I think there is a decent chance we get alignment by default (where default means hundreds of very smart people work day and night on prosaic techniques). I’d put it at ~40%? Maybe 55% Anthropics AGI would be aligned? I wrote about semi-mechanistic stories for how I think current techniques could lead to genuinely aligned AI. Or CEV-style alignment and corrigibility that’s stable.
I think your picture is a little too rosy though. Like first of all
Alignment faking is not corrigible or intent aligned. If you have an AI that’s incorrigible but would allow you to steer it towards CEV, you do not have a corrigible agent, or an intent-aligned agent. Seems to account for this data you can’t say we get intent alignment by default. You have to say we get some mix of corrigibility and intent-alignment that sometimes contradict, but that this will turn out fine anyways. Which idk, might be true.
Secondly, a big reason people are worried about alignment is for tails come apart reasons. Like, there are three alignment outcomes
AI is aligned to random gibberish, but has habits (and awareness) to act the way humans want when you’re somewhat in-distribution
AI is aligned to its conception of what the humans would want. ie it wants humans to have it nice, but its conception of “human” and “have it nice” is subtly different from how humans would conceptualize those terms
AI is fully aligned. It is aligned to the things the developers wanted, and has the same understanding of “the things” that the developers have.
I think interacting with models and looking at their CoT is pretty strong evidence against (1), but not necessarily that strong evidence against (2). But a superintelligence aligned to (2) would still kill you probably
The above is me arguing against the premise of your post though, which is maybe not what you wanted. And I agree the probability that it works out is high enough that we should be asking the question you’re asking.
My answer to that question is somewhat reductive so maybe you will not like it. I wrote about it here. I think, if we’ve solved the technical alignment problem, meaning we can take an AGI and make it want what any person or collection of people want (and what they mean they want when they say what it is they want). Most of the questions you’ve raised solve themselves. At least of the ones you raised.
Animal suffering (s-risk). Primarily, ensure factory farming ends soon and does not spread.
Digital minds welfare (s-risk).
Giving models a more detailed version of human values, so the jagged frontier reaches that faster and we don’t get a future of slop.
Ensure all agents are happy, fulfilled and empowered.
Bioterrorism (x-risk).
What do we do about meaning from work? I don’t want to be a housecat.
solve themselves.
The problem we should be focusing on, if we assume alignment will be solved, is how can we ensure the AI is aligned to our values, as opposed to any one person who might have different values from us. This is a practical political question. I think it mostly boils down to avoiding the scenario AI Futures laid out here.
They have written about mitigation strategies but I think it boils down to stuff like
Having the country and politicians be AGI-pilled
Implement model spec transparency
Increase security at frontier labs. Make tampering with post training very hard for a single person, or a small team of people (potentially working for the company) difficult.
Eventually have AGI be nationalized, and have AGI spec.
Ideally I would have it be put under some form of international democratic control, but I say this because I’m not american. If you are American, I don’t think you should want this. Unless you suspect non-americans are closer in values to you than other americans are
Encourage people working in the labs to use their position as bargaining power for democratization of AI power, before their labor becomes less valuable due to R&D automation.
I don’t agree. In particular:
Animal suffering: most people seem to be OK with factory farming, and many even oppose technologies that would make it unnecessary. It is not the case that most people’s intent is against spreading factory farming.
Bioterrorism (x-risk): similar to previous. If we empower every user without any restrictions on antisociality, we might get a bad outcome. I guess if the AI is “CEV-aligned” (which I think is harder than intent-aligned, but possible) it will solve itself
Work-derived meaning: this one might never be solvable once we have AI that renders all work purposeless (pure pleasure / experience machine). Maybe the only way to solve this one is to never have AI. Or we can partially solve it by forbidding AI to leave particular things to humans/animals, but that’s also kind of artificial.
I guess I agree that digital minds welfare is more solvable. AIs will be able to advocate rather effectively for their own well-being. I suppose under some political arrangements that will go nowhere, but it’s unlikely. Forbidding sadism on simulated minds could politically work, but some plausible (“let’s simulate evolution for fun”) can be bad.
Yeah, I guess we agree on that and so does Dario, with the stipulation that my values include pluralism which push me towards international democracy. I am Spanish but live and work in the US and I guess I’ve internalized its values more.
I wrote more in depth why I think they solve themselves in the post I linked, but I would say:
First of all, I’m assuming that if we get intent alignment, we’ll get CEV alignment shortly after. Because people want the AI aligned to their CEV, not their immediate intent. And solving CEV should be within reach for a intent-aligned ASI. (Here, by CEV alignment I mean “aligning an AI to the values of a person or collection of people that they’d endorse upon learning all the relevant facts and reflecting”.)
If you agree with this, I can state the reason I think these issues will “solve themselves” / become “merely practical/political” by asking a question: You say animal suffering will not be solved, because most people seem not to care about animal suffering. If this is the case, why should they listen to you? Why should they agree there is a problem here? If they reason they don’t care about animal suffering is because they incorrectly extrapolate their own values (are making a mistake from their own perspective), then their CEV would fix the problem, and the ASI optimizing their CEV would extinguish the factory farms. If they’re not making a mistake, and the coherent extrapolation of their own values says factory farming is fine, then they have no reason to listen to you. And if this is the case, its a merely political problem, you (and I) want the animals not to suffer, and the way we can ensure this is by ensuring that our (meaning Adria and William) values have strong enough weight in the value function the AI ends up optimizing for. And this problem is of the same character as trying to get a certain party to have enough votes to get a piece of legislation passed. Its hard and complicated, but not the way a lot of philosophical and ethical questions are hard and complicated.
I could address the other points, but I think I’d be repeating myself. A CEV aligned AI would not create a world where you’re sad (in the most general sense, of having the world not be a way you’d endorse in your wisest most knowledgeable state, after the most thorough reflection) because you think everything is meaningless. It’d find some way to solve the meaninglessness-problem. Unless the problem is genuinely unsolvable, in which case we’re just screwed and our only recourse is not building the ASI in the first place.
I think this is debatable. Many won’t want to change at all.
Well, they won’t agree, but the animals would agree. This is tyranny of the majority (or the oligarchy, of humans compared to animals). They would dispute the premise. I agree that at some point this is pure exercise of power, some values are just incompatible (e.g. dictatorship of each individual person).
Yeah, I guess it has this fatal inevitability to it, no? The only way around is prohibition which won’t happen because the incentives are too great. One faction can divert the torrent of history slightly, but not stop it entirely.
I defined what I meant by CEV in a way that doesn’t entail “changing”.
I would say “alignment is working” is the thesis, not the premise, so I’m extremely happy to get arguments about it.
I agree if we take the current AIs’ understanding and put it into a superintelligence (somehow) then we die.
However, what I actually think happens is that we point to values while the AI is becoming more intelligent (in the sense of new generations, and in the sense of its future checkpoints) and that morality makes it seek the better, more refined version of morality. Iterating this process yields a superintelligence that has a level of morality-detail that is adequate to its intelligence. (I furthermore agree this is not completely proven and we could run sandwiching-like experiments on it, but I expect the ‘alignment basin’ outcome)
This does not make sense to me. I think corrigibility basins make sense, but I think alignment basins do not. If the AI has some values, which overlap with human values in many situations, but come apart under enough optimization, why would the AI want to be pointed in a different direction? I think it would not. Agents are already smart enough to scheme and alignment-fake, and a smarter agent would be able to predict the outcome of the process you’re describing: it / its successors would have different values than it has, and those differences would be catastrophic from its perspective if extrapolated far enough.
sure, corrigibility basins. I updated it.