Surrogate goals are defined here, or (not by that name) here. IIRC, the gist of it is something like: “let’s make an AGI from whose perspective the best possible thing is utopia, and the second-worse possible thing is eternal torture throughout the universe, and the worst possible thing is some specific random thing like a stack of 189 boxes on a certain table in a very specific configuration. Then the idea is that if there’s a conflict between AGIs, and threats are made, and these threats are then carried out (or alternatively if a cosmic ray flips a crucial bit), then we’re now more likely to get stacks of boxes instead of hell.
This may have an obvious response, but I can’t quite see it: If the worst possible thing is a negligible change, an easily achievable state, shouldn’t an AGI want to work to prevent that catastrophic risk? Couldn’t this cause terribly conflicting priorities?
If there is a minor thing that the AGI despises above all, surely some joker will make a point of trying to see what happens when they instruct their local copy of Marsupial-51B to perform the random inconsequential action.
It might be tempting to try to compromise on utopia to avoid a strong risk of the literal worst possible thing.
Apologies if there’s a reason why this is obviously not a concern :)
Yeah, that’s a known problem. I don’t quite remember what the go-to solutions where that people discussed. I think creating an s-risks is expensive, so negating the surrogate goal could also be something that is almost as expensive… But I imagine an AI would also have to be a good satisficer for this to work or it would still run into the problem with conflicting priorities. I remember Caspar Oesterheld (one of the folks who originated the idea) worrying about AI creating infinite series of surrogate goals to protect the previous surrogate goal. It’s not a deployment-ready solution in my mind, just an example of a promising research direction.
Surrogate goals are defined here, or (not by that name) here. IIRC, the gist of it is something like: “let’s make an AGI from whose perspective the best possible thing is utopia, and the second-worse possible thing is eternal torture throughout the universe, and the worst possible thing is some specific random thing like a stack of 189 boxes on a certain table in a very specific configuration. Then the idea is that if there’s a conflict between AGIs, and threats are made, and these threats are then carried out (or alternatively if a cosmic ray flips a crucial bit), then we’re now more likely to get stacks of boxes instead of hell.
This may have an obvious response, but I can’t quite see it: If the worst possible thing is a negligible change, an easily achievable state, shouldn’t an AGI want to work to prevent that catastrophic risk? Couldn’t this cause terribly conflicting priorities?
If there is a minor thing that the AGI despises above all, surely some joker will make a point of trying to see what happens when they instruct their local copy of Marsupial-51B to perform the random inconsequential action.
It might be tempting to try to compromise on utopia to avoid a strong risk of the literal worst possible thing.
Apologies if there’s a reason why this is obviously not a concern :)
We’d want to pick something to
have badness per unit of resources (or opportunity cost) only moderately higher than any actually bad thing according to the surrogate,
scale like actually bad things according to the surrogate, and
be extraordinarily unlikely to occur otherwise.
Maybe something like doing some very specific computations, or building very specific objects.
Yeah, that’s a known problem. I don’t quite remember what the go-to solutions where that people discussed. I think creating an s-risks is expensive, so negating the surrogate goal could also be something that is almost as expensive… But I imagine an AI would also have to be a good satisficer for this to work or it would still run into the problem with conflicting priorities. I remember Caspar Oesterheld (one of the folks who originated the idea) worrying about AI creating infinite series of surrogate goals to protect the previous surrogate goal. It’s not a deployment-ready solution in my mind, just an example of a promising research direction.