Linch comments on Disentangling inner alignment failures

Linch 10 Oct 2022 23:59 UTC
LW: 2 AF: 2
0
AF
Distributional shift: The worry is precisely that capabilities will generalize better than goals across the distributional shift. If capabilities didn’t generalize, we’d be fine. But as the CoinRun agent examplifies, you can get AIs that capably pursue a different objective after a distributional shift than the one you were hoping for. One difference to deception is that models which become incompetent after a distributional shift are in fact quite plausible. But to the extent that we think we’ll get goal misgeneralization specifically, the underlying worry again seems to be that capabilities will be robust while alignment will not.
One thing to flag is that even if for any given model, the probability of capabilities generalizing is very low, total doom can still be high, since there might be many tries at getting models that generalize well across distributional shifts, whereas the selection pressures to getting alignment robustness is comparably weaker. You can imagine a 2x2 quadrant of capabilities vs alignment generalizability across distributional shift:
Capabilities doesn’t generalize, alignment doesn’t: irrelevant
Capabilities doesn’t generalize, alignment does: irrelevant
Capabilities generalizes, alignment doesn’t: potentially very dangerous, especially if power-seeking. Agent (or agent and friends) acquires more power and may attempt a takeover.
Capabilities generalizes, alignment does: Good, but not clearly great. By default I won’t expect it to be powerseeking (unless you’re deliberately creating a sovereign), so it only has as much power as humans allow it to have. So the AI might risk being outcompeted by their more nefarious peers.