Interpretability research is probably interesting for both, but otherwise I don’t see a lot of overlap in research topics. Maybe comparing the CLR research agenda against an alignment research agenda could help quantify the overlap a bit more. (There is probably a lot of transferability in the required skills though – game theory, ML, etc.)
I don’t see how any of these actually help reduce s-risk. Like, if we know some bargaining solutions lead to everyone being terrible and others lead to everyone being super happy so what? Its not like we can tremendously influence the bargaining solution our AI & those it meets settles on after reflection.
In the tractability footnote above I make the case that it should be at least vastly easier than influencing the utility functions of all AIs to make alignment succeed.
Yeah, I expect that if you make a superintelligence it won’t need humans to tell it the best bargaining math it can use. You are trying to do better than a superintelligence at a task it is highly incentivized to be good at, so you are not going to beat the superintelligence.
Secondly, you need to assume that the pessimization of the superintelligence’s values would be bad, but in fact I expect it to be just as neutral as the optimization.
I don’t care about wars between unaligned AIs, even if they do often have them. Their values will be completely orthogonal to my own, so their inverses will also. Even in wars between aligned and unaligned (hitler, for example) humans, suffering which I would trade the world to stop does not happen.
Also, wars end, it’d be very weird if you got two AIs warring with each other for eternity. If both knew this was the outcome (of placed some amount of probability on it), why would either of them start the war?
People worried about s-risks should be worried about some kinds of partial alignment solutions, where you get the AI aligned enough to care about keeping humans (or other things that are morally relevant) around, but not aligned enough to care if they’re happy (or satisfying any other of a number of values), so you get a bunch of things that can feel pain in moderate pain for eternity.
I expect that if you make a superintelligence it won’t need humans to tell it the best bargaining math it can use
I’m not a fan of idealizing superintelligences. 10+ years ago that was the only way to infer any hard information about worst-case scenarios. Assume perfect play from all sides, and you end up with a fairly narrow game tree that you can reason about. But now it’s a pretty good guess that superintelligences will be more advanced successors of GPT-4 and such. That tells us a lot about the sort of training regimes through which they might learn bargaining, and what sorts of bargaining solutions they might completely unreflectedly employ in specific situations. We can reason about what sorts of training regimes will instill which decision theories in AIs, so why not the same for bargaining.
If we think we can punt the problem to them, then we need to make sure they reflect on how they bargain and the game theoretic implication of that. We may want to train them to seek out gains from trade like it’s useful in a generally cooperative environment, rather than seek out exploits as it would be useful in a more hostile environment.
If we find that we can’t reliably punt the problem to them, we now still have the chance to decide on the right (or a random) bargaining solution and train enough AIs to adopt it (more than 1/3rd? Just particularly prominent projects?) to make it the Schelling point for future AIs. But that window will close when they (OpenAI, DeepMind, vel sim.) finalize the corpus of the training data for the AIs that’ll take over the world.
I don’t care about wars between unaligned AIs, even if they do often have them
Okay. I’m concerned with scenarios where at least one powerful AI is at least as (seemingly) well aligned as GPT-4.
Secondly, you need to assume that the pessimization of the superintelligence’s values would be bad, but in fact I expect it to be just as neutral as the optimization.
Can you rephrase? I don’t follow. It’s probably “pessimization” that throws me off?
why would either of them start the war?
Well, I’m already concerned about finite versions of that. Bad enough to warrant a lot of attention in my mind. But there are different reasons why that could happen. The one that starts the war could’ve made any of a couple different mistakes in assessing their opponent. It could make mistakes in the process of readying its weapons. Finally, the victim of the aggression could make mistakes assessing the aggressor. Naturally, that’s implausible if superintelligences are literally so perfect that they cannot make mistakes ever, but that’s not my starting point. I assume that they’re going to be about as flawed as the NSA, DoD, etc., only in different ways.
These suggestions are all completely opaque to me. I don’t see how a single one of them would work to reduce s-risk, or indeed understand what the first three are or why the last one matters. That’s after becoming conversant with the majority of thinking and terminology around alignment approaches.
So maybe that’s one reason you don’t see people.discussing s-risk much—the few people doing it are not communicating their ideas in a compelling or understandable way.
That doesn’t answer the main question, but cause-building strategy is one factor in any question of why things are or aren’t attended.
Surrogate goals are defined here, or (not by that name) here. IIRC, the gist of it is something like: “let’s make an AGI from whose perspective the best possible thing is utopia, and the second-worse possible thing is eternal torture throughout the universe, and the worst possible thing is some specific random thing like a stack of 189 boxes on a certain table in a very specific configuration. Then the idea is that if there’s a conflict between AGIs, and threats are made, and these threats are then carried out (or alternatively if a cosmic ray flips a crucial bit), then we’re now more likely to get stacks of boxes instead of hell.
This may have an obvious response, but I can’t quite see it: If the worst possible thing is a negligible change, an easily achievable state, shouldn’t an AGI want to work to prevent that catastrophic risk? Couldn’t this cause terribly conflicting priorities?
If there is a minor thing that the AGI despises above all, surely some joker will make a point of trying to see what happens when they instruct their local copy of Marsupial-51B to perform the random inconsequential action.
It might be tempting to try to compromise on utopia to avoid a strong risk of the literal worst possible thing.
Apologies if there’s a reason why this is obviously not a concern :)
Yeah, that’s a known problem. I don’t quite remember what the go-to solutions where that people discussed. I think creating an s-risks is expensive, so negating the surrogate goal could also be something that is almost as expensive… But I imagine an AI would also have to be a good satisficer for this to work or it would still run into the problem with conflicting priorities. I remember Caspar Oesterheld (one of the folks who originated the idea) worrying about AI creating infinite series of surrogate goals to protect the previous surrogate goal. It’s not a deployment-ready solution in my mind, just an example of a promising research direction.
Some promising interventions against s-risks that I’m aware of are:
Figure out what’s going on with bargaining solutions. Nash, Kalai, or Kalai-Smorodinsky? Is there one that is privileged in some impartial way?
Is there some sort of “leader election” algorithm over bargaining solutions?
Do surrogate goals work, are they cooperative enough?
Will neural-net based AIs be comprehensible to each other, if so, what does the open source game theory say about how conflicts will play out?
And of course CLR’s research agenda.
Interpretability research is probably interesting for both, but otherwise I don’t see a lot of overlap in research topics. Maybe comparing the CLR research agenda against an alignment research agenda could help quantify the overlap a bit more. (There is probably a lot of transferability in the required skills though – game theory, ML, etc.)
I don’t see how any of these actually help reduce s-risk. Like, if we know some bargaining solutions lead to everyone being terrible and others lead to everyone being super happy so what? Its not like we can tremendously influence the bargaining solution our AI & those it meets settles on after reflection.
In the tractability footnote above I make the case that it should be at least vastly easier than influencing the utility functions of all AIs to make alignment succeed.
Yeah, I expect that if you make a superintelligence it won’t need humans to tell it the best bargaining math it can use. You are trying to do better than a superintelligence at a task it is highly incentivized to be good at, so you are not going to beat the superintelligence.
Secondly, you need to assume that the pessimization of the superintelligence’s values would be bad, but in fact I expect it to be just as neutral as the optimization.
I don’t care about wars between unaligned AIs, even if they do often have them. Their values will be completely orthogonal to my own, so their inverses will also. Even in wars between aligned and unaligned (hitler, for example) humans, suffering which I would trade the world to stop does not happen.
Also, wars end, it’d be very weird if you got two AIs warring with each other for eternity. If both knew this was the outcome (of placed some amount of probability on it), why would either of them start the war?
People worried about s-risks should be worried about some kinds of partial alignment solutions, where you get the AI aligned enough to care about keeping humans (or other things that are morally relevant) around, but not aligned enough to care if they’re happy (or satisfying any other of a number of values), so you get a bunch of things that can feel pain in moderate pain for eternity.
I’m not a fan of idealizing superintelligences. 10+ years ago that was the only way to infer any hard information about worst-case scenarios. Assume perfect play from all sides, and you end up with a fairly narrow game tree that you can reason about. But now it’s a pretty good guess that superintelligences will be more advanced successors of GPT-4 and such. That tells us a lot about the sort of training regimes through which they might learn bargaining, and what sorts of bargaining solutions they might completely unreflectedly employ in specific situations. We can reason about what sorts of training regimes will instill which decision theories in AIs, so why not the same for bargaining.
If we think we can punt the problem to them, then we need to make sure they reflect on how they bargain and the game theoretic implication of that. We may want to train them to seek out gains from trade like it’s useful in a generally cooperative environment, rather than seek out exploits as it would be useful in a more hostile environment.
If we find that we can’t reliably punt the problem to them, we now still have the chance to decide on the right (or a random) bargaining solution and train enough AIs to adopt it (more than 1/3rd? Just particularly prominent projects?) to make it the Schelling point for future AIs. But that window will close when they (OpenAI, DeepMind, vel sim.) finalize the corpus of the training data for the AIs that’ll take over the world.
Okay. I’m concerned with scenarios where at least one powerful AI is at least as (seemingly) well aligned as GPT-4.
Can you rephrase? I don’t follow. It’s probably “pessimization” that throws me off?
Well, I’m already concerned about finite versions of that. Bad enough to warrant a lot of attention in my mind. But there are different reasons why that could happen. The one that starts the war could’ve made any of a couple different mistakes in assessing their opponent. It could make mistakes in the process of readying its weapons. Finally, the victim of the aggression could make mistakes assessing the aggressor. Naturally, that’s implausible if superintelligences are literally so perfect that they cannot make mistakes ever, but that’s not my starting point. I assume that they’re going to be about as flawed as the NSA, DoD, etc., only in different ways.
These suggestions are all completely opaque to me. I don’t see how a single one of them would work to reduce s-risk, or indeed understand what the first three are or why the last one matters. That’s after becoming conversant with the majority of thinking and terminology around alignment approaches.
So maybe that’s one reason you don’t see people.discussing s-risk much—the few people doing it are not communicating their ideas in a compelling or understandable way.
That doesn’t answer the main question, but cause-building strategy is one factor in any question of why things are or aren’t attended.
Surrogate goals are defined here, or (not by that name) here. IIRC, the gist of it is something like: “let’s make an AGI from whose perspective the best possible thing is utopia, and the second-worse possible thing is eternal torture throughout the universe, and the worst possible thing is some specific random thing like a stack of 189 boxes on a certain table in a very specific configuration. Then the idea is that if there’s a conflict between AGIs, and threats are made, and these threats are then carried out (or alternatively if a cosmic ray flips a crucial bit), then we’re now more likely to get stacks of boxes instead of hell.
This may have an obvious response, but I can’t quite see it: If the worst possible thing is a negligible change, an easily achievable state, shouldn’t an AGI want to work to prevent that catastrophic risk? Couldn’t this cause terribly conflicting priorities?
If there is a minor thing that the AGI despises above all, surely some joker will make a point of trying to see what happens when they instruct their local copy of Marsupial-51B to perform the random inconsequential action.
It might be tempting to try to compromise on utopia to avoid a strong risk of the literal worst possible thing.
Apologies if there’s a reason why this is obviously not a concern :)
We’d want to pick something to
have badness per unit of resources (or opportunity cost) only moderately higher than any actually bad thing according to the surrogate,
scale like actually bad things according to the surrogate, and
be extraordinarily unlikely to occur otherwise.
Maybe something like doing some very specific computations, or building very specific objects.
Yeah, that’s a known problem. I don’t quite remember what the go-to solutions where that people discussed. I think creating an s-risks is expensive, so negating the surrogate goal could also be something that is almost as expensive… But I imagine an AI would also have to be a good satisficer for this to work or it would still run into the problem with conflicting priorities. I remember Caspar Oesterheld (one of the folks who originated the idea) worrying about AI creating infinite series of surrogate goals to protect the previous surrogate goal. It’s not a deployment-ready solution in my mind, just an example of a promising research direction.