I’d recommend checking out this post critiquing this view, if you haven’t read it already. Summary of the counterpoints:
(Intent) alignment doesn’t seem sufficient to ensure an AI makes safe decisions about subtle bargaining problems in a situation of high competitive pressure with other AIs. I don’t expect the kinds of capabilities progress that is incentivized by default to suffice for us to be able to defer these decisions to the AI, especially given path-dependence on feedback from humans who’d be pretty naïve about this stuff. (C.f. this post—you need the human feedback at bottom to be sufficiently high quality to not get garbage-in, garbage-out problems even if you’ve solved the hard parts of alignment.)
To the extent that solving all of intent alignment is too intractable, focusing on subsets of alignment that are especially likely to avoid s-risks—e.g. preventing AIs from intrinsically valuing frustrating others’ preferences—might be promising. I don’t think mainstream alignment research prioritizes these.
I do think that alignment solutions which try to solve value alignment have more of a chance of causing s-risks than those which solve corrigibility. In particular because if you get the AI to care about the same things humans value, this is pretty close to getting the AI to actively dislike things that humans value, and if there’s even one component of human values which is pessimized, this seems extremely bad even if the rest of the parts are optimized.
Interpretability research is probably interesting for both, but otherwise I don’t see a lot of overlap in research topics. Maybe comparing the CLR research agenda against an alignment research agenda could help quantify the overlap a bit more. (There is probably a lot of transferability in the required skills though – game theory, ML, etc.)
I don’t see how any of these actually help reduce s-risk. Like, if we know some bargaining solutions lead to everyone being terrible and others lead to everyone being super happy so what? Its not like we can tremendously influence the bargaining solution our AI & those it meets settles on after reflection.
In the tractability footnote above I make the case that it should be at least vastly easier than influencing the utility functions of all AIs to make alignment succeed.
Yeah, I expect that if you make a superintelligence it won’t need humans to tell it the best bargaining math it can use. You are trying to do better than a superintelligence at a task it is highly incentivized to be good at, so you are not going to beat the superintelligence.
Secondly, you need to assume that the pessimization of the superintelligence’s values would be bad, but in fact I expect it to be just as neutral as the optimization.
I don’t care about wars between unaligned AIs, even if they do often have them. Their values will be completely orthogonal to my own, so their inverses will also. Even in wars between aligned and unaligned (hitler, for example) humans, suffering which I would trade the world to stop does not happen.
Also, wars end, it’d be very weird if you got two AIs warring with each other for eternity. If both knew this was the outcome (of placed some amount of probability on it), why would either of them start the war?
People worried about s-risks should be worried about some kinds of partial alignment solutions, where you get the AI aligned enough to care about keeping humans (or other things that are morally relevant) around, but not aligned enough to care if they’re happy (or satisfying any other of a number of values), so you get a bunch of things that can feel pain in moderate pain for eternity.
I expect that if you make a superintelligence it won’t need humans to tell it the best bargaining math it can use
I’m not a fan of idealizing superintelligences. 10+ years ago that was the only way to infer any hard information about worst-case scenarios. Assume perfect play from all sides, and you end up with a fairly narrow game tree that you can reason about. But now it’s a pretty good guess that superintelligences will be more advanced successors of GPT-4 and such. That tells us a lot about the sort of training regimes through which they might learn bargaining, and what sorts of bargaining solutions they might completely unreflectedly employ in specific situations. We can reason about what sorts of training regimes will instill which decision theories in AIs, so why not the same for bargaining.
If we think we can punt the problem to them, then we need to make sure they reflect on how they bargain and the game theoretic implication of that. We may want to train them to seek out gains from trade like it’s useful in a generally cooperative environment, rather than seek out exploits as it would be useful in a more hostile environment.
If we find that we can’t reliably punt the problem to them, we now still have the chance to decide on the right (or a random) bargaining solution and train enough AIs to adopt it (more than 1/3rd? Just particularly prominent projects?) to make it the Schelling point for future AIs. But that window will close when they (OpenAI, DeepMind, vel sim.) finalize the corpus of the training data for the AIs that’ll take over the world.
I don’t care about wars between unaligned AIs, even if they do often have them
Okay. I’m concerned with scenarios where at least one powerful AI is at least as (seemingly) well aligned as GPT-4.
Secondly, you need to assume that the pessimization of the superintelligence’s values would be bad, but in fact I expect it to be just as neutral as the optimization.
Can you rephrase? I don’t follow. It’s probably “pessimization” that throws me off?
why would either of them start the war?
Well, I’m already concerned about finite versions of that. Bad enough to warrant a lot of attention in my mind. But there are different reasons why that could happen. The one that starts the war could’ve made any of a couple different mistakes in assessing their opponent. It could make mistakes in the process of readying its weapons. Finally, the victim of the aggression could make mistakes assessing the aggressor. Naturally, that’s implausible if superintelligences are literally so perfect that they cannot make mistakes ever, but that’s not my starting point. I assume that they’re going to be about as flawed as the NSA, DoD, etc., only in different ways.
These suggestions are all completely opaque to me. I don’t see how a single one of them would work to reduce s-risk, or indeed understand what the first three are or why the last one matters. That’s after becoming conversant with the majority of thinking and terminology around alignment approaches.
So maybe that’s one reason you don’t see people.discussing s-risk much—the few people doing it are not communicating their ideas in a compelling or understandable way.
That doesn’t answer the main question, but cause-building strategy is one factor in any question of why things are or aren’t attended.
Surrogate goals are defined here, or (not by that name) here. IIRC, the gist of it is something like: “let’s make an AGI from whose perspective the best possible thing is utopia, and the second-worse possible thing is eternal torture throughout the universe, and the worst possible thing is some specific random thing like a stack of 189 boxes on a certain table in a very specific configuration. Then the idea is that if there’s a conflict between AGIs, and threats are made, and these threats are then carried out (or alternatively if a cosmic ray flips a crucial bit), then we’re now more likely to get stacks of boxes instead of hell.
This may have an obvious response, but I can’t quite see it: If the worst possible thing is a negligible change, an easily achievable state, shouldn’t an AGI want to work to prevent that catastrophic risk? Couldn’t this cause terribly conflicting priorities?
If there is a minor thing that the AGI despises above all, surely some joker will make a point of trying to see what happens when they instruct their local copy of Marsupial-51B to perform the random inconsequential action.
It might be tempting to try to compromise on utopia to avoid a strong risk of the literal worst possible thing.
Apologies if there’s a reason why this is obviously not a concern :)
Yeah, that’s a known problem. I don’t quite remember what the go-to solutions where that people discussed. I think creating an s-risks is expensive, so negating the surrogate goal could also be something that is almost as expensive… But I imagine an AI would also have to be a good satisficer for this to work or it would still run into the problem with conflicting priorities. I remember Caspar Oesterheld (one of the folks who originated the idea) worrying about AI creating infinite series of surrogate goals to protect the previous surrogate goal. It’s not a deployment-ready solution in my mind, just an example of a promising research direction.
I don’t believe that reducing s-risks from AI involves substantially different things than those you’d need to deal with AI alignment.
I’d recommend checking out this post critiquing this view, if you haven’t read it already. Summary of the counterpoints:
(Intent) alignment doesn’t seem sufficient to ensure an AI makes safe decisions about subtle bargaining problems in a situation of high competitive pressure with other AIs. I don’t expect the kinds of capabilities progress that is incentivized by default to suffice for us to be able to defer these decisions to the AI, especially given path-dependence on feedback from humans who’d be pretty naïve about this stuff. (C.f. this post—you need the human feedback at bottom to be sufficiently high quality to not get garbage-in, garbage-out problems even if you’ve solved the hard parts of alignment.)
To the extent that solving all of intent alignment is too intractable, focusing on subsets of alignment that are especially likely to avoid s-risks—e.g. preventing AIs from intrinsically valuing frustrating others’ preferences—might be promising. I don’t think mainstream alignment research prioritizes these.
I do think that alignment solutions which try to solve value alignment have more of a chance of causing s-risks than those which solve corrigibility. In particular because if you get the AI to care about the same things humans value, this is pretty close to getting the AI to actively dislike things that humans value, and if there’s even one component of human values which is pessimized, this seems extremely bad even if the rest of the parts are optimized.
Some promising interventions against s-risks that I’m aware of are:
Figure out what’s going on with bargaining solutions. Nash, Kalai, or Kalai-Smorodinsky? Is there one that is privileged in some impartial way?
Is there some sort of “leader election” algorithm over bargaining solutions?
Do surrogate goals work, are they cooperative enough?
Will neural-net based AIs be comprehensible to each other, if so, what does the open source game theory say about how conflicts will play out?
And of course CLR’s research agenda.
Interpretability research is probably interesting for both, but otherwise I don’t see a lot of overlap in research topics. Maybe comparing the CLR research agenda against an alignment research agenda could help quantify the overlap a bit more. (There is probably a lot of transferability in the required skills though – game theory, ML, etc.)
I don’t see how any of these actually help reduce s-risk. Like, if we know some bargaining solutions lead to everyone being terrible and others lead to everyone being super happy so what? Its not like we can tremendously influence the bargaining solution our AI & those it meets settles on after reflection.
In the tractability footnote above I make the case that it should be at least vastly easier than influencing the utility functions of all AIs to make alignment succeed.
Yeah, I expect that if you make a superintelligence it won’t need humans to tell it the best bargaining math it can use. You are trying to do better than a superintelligence at a task it is highly incentivized to be good at, so you are not going to beat the superintelligence.
Secondly, you need to assume that the pessimization of the superintelligence’s values would be bad, but in fact I expect it to be just as neutral as the optimization.
I don’t care about wars between unaligned AIs, even if they do often have them. Their values will be completely orthogonal to my own, so their inverses will also. Even in wars between aligned and unaligned (hitler, for example) humans, suffering which I would trade the world to stop does not happen.
Also, wars end, it’d be very weird if you got two AIs warring with each other for eternity. If both knew this was the outcome (of placed some amount of probability on it), why would either of them start the war?
People worried about s-risks should be worried about some kinds of partial alignment solutions, where you get the AI aligned enough to care about keeping humans (or other things that are morally relevant) around, but not aligned enough to care if they’re happy (or satisfying any other of a number of values), so you get a bunch of things that can feel pain in moderate pain for eternity.
I’m not a fan of idealizing superintelligences. 10+ years ago that was the only way to infer any hard information about worst-case scenarios. Assume perfect play from all sides, and you end up with a fairly narrow game tree that you can reason about. But now it’s a pretty good guess that superintelligences will be more advanced successors of GPT-4 and such. That tells us a lot about the sort of training regimes through which they might learn bargaining, and what sorts of bargaining solutions they might completely unreflectedly employ in specific situations. We can reason about what sorts of training regimes will instill which decision theories in AIs, so why not the same for bargaining.
If we think we can punt the problem to them, then we need to make sure they reflect on how they bargain and the game theoretic implication of that. We may want to train them to seek out gains from trade like it’s useful in a generally cooperative environment, rather than seek out exploits as it would be useful in a more hostile environment.
If we find that we can’t reliably punt the problem to them, we now still have the chance to decide on the right (or a random) bargaining solution and train enough AIs to adopt it (more than 1/3rd? Just particularly prominent projects?) to make it the Schelling point for future AIs. But that window will close when they (OpenAI, DeepMind, vel sim.) finalize the corpus of the training data for the AIs that’ll take over the world.
Okay. I’m concerned with scenarios where at least one powerful AI is at least as (seemingly) well aligned as GPT-4.
Can you rephrase? I don’t follow. It’s probably “pessimization” that throws me off?
Well, I’m already concerned about finite versions of that. Bad enough to warrant a lot of attention in my mind. But there are different reasons why that could happen. The one that starts the war could’ve made any of a couple different mistakes in assessing their opponent. It could make mistakes in the process of readying its weapons. Finally, the victim of the aggression could make mistakes assessing the aggressor. Naturally, that’s implausible if superintelligences are literally so perfect that they cannot make mistakes ever, but that’s not my starting point. I assume that they’re going to be about as flawed as the NSA, DoD, etc., only in different ways.
These suggestions are all completely opaque to me. I don’t see how a single one of them would work to reduce s-risk, or indeed understand what the first three are or why the last one matters. That’s after becoming conversant with the majority of thinking and terminology around alignment approaches.
So maybe that’s one reason you don’t see people.discussing s-risk much—the few people doing it are not communicating their ideas in a compelling or understandable way.
That doesn’t answer the main question, but cause-building strategy is one factor in any question of why things are or aren’t attended.
Surrogate goals are defined here, or (not by that name) here. IIRC, the gist of it is something like: “let’s make an AGI from whose perspective the best possible thing is utopia, and the second-worse possible thing is eternal torture throughout the universe, and the worst possible thing is some specific random thing like a stack of 189 boxes on a certain table in a very specific configuration. Then the idea is that if there’s a conflict between AGIs, and threats are made, and these threats are then carried out (or alternatively if a cosmic ray flips a crucial bit), then we’re now more likely to get stacks of boxes instead of hell.
This may have an obvious response, but I can’t quite see it: If the worst possible thing is a negligible change, an easily achievable state, shouldn’t an AGI want to work to prevent that catastrophic risk? Couldn’t this cause terribly conflicting priorities?
If there is a minor thing that the AGI despises above all, surely some joker will make a point of trying to see what happens when they instruct their local copy of Marsupial-51B to perform the random inconsequential action.
It might be tempting to try to compromise on utopia to avoid a strong risk of the literal worst possible thing.
Apologies if there’s a reason why this is obviously not a concern :)
We’d want to pick something to
have badness per unit of resources (or opportunity cost) only moderately higher than any actually bad thing according to the surrogate,
scale like actually bad things according to the surrogate, and
be extraordinarily unlikely to occur otherwise.
Maybe something like doing some very specific computations, or building very specific objects.
Yeah, that’s a known problem. I don’t quite remember what the go-to solutions where that people discussed. I think creating an s-risks is expensive, so negating the surrogate goal could also be something that is almost as expensive… But I imagine an AI would also have to be a good satisficer for this to work or it would still run into the problem with conflicting priorities. I remember Caspar Oesterheld (one of the folks who originated the idea) worrying about AI creating infinite series of surrogate goals to protect the previous surrogate goal. It’s not a deployment-ready solution in my mind, just an example of a promising research direction.