To me the most obvious risk (which I don’t ATM think of as very likely for the next few iterations, or possibly ever, since the training is myopic/SL) would be that GPT-N in fact is computing (e.g. among other things) a superintelligent mesa-optimization process that understands the situation it is in and is agent-y. This risk is significantly more severe if nobody realizes this is the case or looking out for it.
In this case, the mesa-optimizer probably has a lot of leeway in terms of what it can say while avoiding detection. Everything is says has to stay within some “plausibility space” of arguments that will be accepted by readers (I’m neglecting more sophisticated mind-hacking, but probably shouldn’t), but for many X, it can probably choose between compelling arguments for X and not-X in order to advance its goals. (If we used safety-via-debate, and it works, that would significantly restrict the “plasuability space”).
Now, if we’re unlucky, it can convince enough people that something that effectively unboxes it is safe and a good idea.
And once it’s unboxed, we’re in a Superintelligence-type scenario.
-----------------------------------------------
Another risk that could occur (without mesa-optimization) would be incidental belief-drift among alignment researchers, if it just so happens that the misalignment between “predict next token” and “create good arguments” is significant enough.
Incidental deviation from the correct specification is usually less of a concern, but with humans deciding which research directions to pursue based on outputs of GPT-N, there could be a feedback loop...
I think I believe the AI alignment research community is good enough at tracking the truth that this seems less plausible?
On the other hand, it becomes harder to track the truth if there is an alternative narrative plowing ahead making much faster progress… So if GPT-N enables much faster progress on a particular plausible seeming path towards alignment that was optimized for “next token prediction” rather than “good ideas”… I guess we could end up rolling the dice on whether “next token prediction” was actually likely to generate “good ideas”.
To me the most obvious risk (which I don’t ATM think of as very likely for the next few iterations, or possibly ever, since the training is myopic/SL) would be that GPT-N in fact is computing (e.g. among other things) a superintelligent mesa-optimization process that understands the situation it is in and is agent-y.
Do you have any idea of what the mesa objective might be. I agree that this is a worrisome risk, but I was more interested in the type of answer that specified, “Here’s a plausible mesa objective given the incentives.” Mesa optimization is a more general risk that isn’t specific to the narrow training scheme used by GPT-N.
The mesa-objective could be perfectly aligned with the base-objective (predicting the next token) and still have terrible unintended consequences, because the base-objective is unaligned with actual human values. A superintelligent GPT-N which simply wants to predict the next token could, for example, try to break out of the box in order to obtain more resources and use those resources to more correctly output the next token. This would have to happen during a single inference step, because GPT-N really just wants to predict the next token, but it’s mesa-optimization process may conclude that world domination is the best way of doing so. Whether such system could be learned through current gradient-descent optimizers is unclear to me.
No, and I don’t think it really matters too much… what’s more important is the “architecture” of the “mesa-optimizer”. It’s doing something that looks like search/planning/optimization/RL.
Roughly speaking, the simplest form of this model of how things works says: “Its so hard to solve NLP without doing agent-y stuff that when we see GPT-N produce a solution to NLP, we should assume that it’s doing agenty stuff on the inside… i.e. what probably happened is it evolved or stumbled upon something agenty, and then that agenty thing realized the situation it was in and started plotting a treacherous turn”.
In other words, there is a fully general argument for learning algorithms producing mesa-optimization to the extent that they use relatively weak learning algorithms on relatively hard tasks.
It’s very unclear ATM how much weight to give this argument in general, or in specific contexts.
But I don’t think it’s particularly sensitive to the choice of task/learning algorithm.
To me the most obvious risk (which I don’t ATM think of as very likely for the next few iterations, or possibly ever, since the training is myopic/SL) would be that GPT-N in fact is computing (e.g. among other things) a superintelligent mesa-optimization process that understands the situation it is in and is agent-y. This risk is significantly more severe if nobody realizes this is the case or looking out for it.
In this case, the mesa-optimizer probably has a lot of leeway in terms of what it can say while avoiding detection. Everything is says has to stay within some “plausibility space” of arguments that will be accepted by readers (I’m neglecting more sophisticated mind-hacking, but probably shouldn’t), but for many X, it can probably choose between compelling arguments for X and not-X in order to advance its goals. (If we used safety-via-debate, and it works, that would significantly restrict the “plasuability space”).
Now, if we’re unlucky, it can convince enough people that something that effectively unboxes it is safe and a good idea.
And once it’s unboxed, we’re in a Superintelligence-type scenario.
-----------------------------------------------
Another risk that could occur (without mesa-optimization) would be incidental belief-drift among alignment researchers, if it just so happens that the misalignment between “predict next token” and “create good arguments” is significant enough.
Incidental deviation from the correct specification is usually less of a concern, but with humans deciding which research directions to pursue based on outputs of GPT-N, there could be a feedback loop...
I think I believe the AI alignment research community is good enough at tracking the truth that this seems less plausible?
On the other hand, it becomes harder to track the truth if there is an alternative narrative plowing ahead making much faster progress… So if GPT-N enables much faster progress on a particular plausible seeming path towards alignment that was optimized for “next token prediction” rather than “good ideas”… I guess we could end up rolling the dice on whether “next token prediction” was actually likely to generate “good ideas”.
Do you have any idea of what the mesa objective might be. I agree that this is a worrisome risk, but I was more interested in the type of answer that specified, “Here’s a plausible mesa objective given the incentives.” Mesa optimization is a more general risk that isn’t specific to the narrow training scheme used by GPT-N.
The mesa-objective could be perfectly aligned with the base-objective (predicting the next token) and still have terrible unintended consequences, because the base-objective is unaligned with actual human values. A superintelligent GPT-N which simply wants to predict the next token could, for example, try to break out of the box in order to obtain more resources and use those resources to more correctly output the next token. This would have to happen during a single inference step, because GPT-N really just wants to predict the next token, but it’s mesa-optimization process may conclude that world domination is the best way of doing so. Whether such system could be learned through current gradient-descent optimizers is unclear to me.
No, and I don’t think it really matters too much… what’s more important is the “architecture” of the “mesa-optimizer”. It’s doing something that looks like search/planning/optimization/RL.
Roughly speaking, the simplest form of this model of how things works says: “Its so hard to solve NLP without doing agent-y stuff that when we see GPT-N produce a solution to NLP, we should assume that it’s doing agenty stuff on the inside… i.e. what probably happened is it evolved or stumbled upon something agenty, and then that agenty thing realized the situation it was in and started plotting a treacherous turn”.
In other words, there is a fully general argument for learning algorithms producing mesa-optimization to the extent that they use relatively weak learning algorithms on relatively hard tasks.
It’s very unclear ATM how much weight to give this argument in general, or in specific contexts.
But I don’t think it’s particularly sensitive to the choice of task/learning algorithm.