Mateusz Bagiński comments on Ten Levels of AI Alignment Difficulty

Mateusz Bagiński 26 Apr 2024 9:59 UTC
6 points
0
Behavioural Safety is Insufficient
Past this point, we assume following Ajeya Cotra that a strategically aware system which performs well enough to receive perfect human-provided external feedback has probably learned a deceptive human simulating model instead of the intended goal. The later techniques have the potential to address this failure mode. (It is possible that this system would still under-perform on sufficiently superhuman behavioral evaluations)
There are (IMO) plausible threat models in which alignment is very difficult but we don’t need to encounter deceptive alignment. Consider the following scenario:
Our alignment techinques (whatever they are) scale pretty well, as far as we can measure, even up to well-beyond-human-level AGI. However, in the year (say) 2100, the tails come apart. It gradually becomes pretty clear that what we want out powerful AIs to do and what they actually do turns out not to generalize that well outside of the distribution on which we have been testing them so far. At this point, it is to late to roll them back, e.g. because the AIs have become uncorrigible and/or power-seeking. The scenario may also have more systemic character, with AI having already been so tightly integrated into the economy that there is no “undo button”.
This doesn’t assume either the sharp left turn or deceptive alignment, but I’d put it at least at level 8 in your taxonomy.
I’d put the scenario from Karl von Wendt’s novel VIRTUA into this category.
- Sammy Martin 8 Jul 2024 11:08 UTC
  3 points
  1
  Parent
  I agree that this is a real possibility and in the table I did say at level 2,
  Misspecified rewards / ‘outer misalignment’ / structural failures where systems don’t learn adversarial policies ^[2]but do learn to pursue overly crude and clearly underspecified versions of what we want, e.g. the production web or WFLL1.
  From my perspective, it is entirely possible to have an alignment failure that works like this and occurs at difficulty level 2. This is still an ‘easier’ world than the higher levels because you can get killed in a much swifter and earlier way with far less warning in those worlds.
  The reason I wouldn’t put it at level 8 is because presumably the models are following a reasonable proxy for what we want if it generalizes well beyond human level, but this proxy is inadequate in some ways that become apparent later on. The level 8 says not that any misgeneralization occurs but that rapid, unpredictable misgeneralization occurs around the human level such that alignment techniques quickly break down.
  In the scenario you describe, there’d be an opportunity to notice what’s going on (after all you’d have superhuman AI that more or less does what it’s told to help you predict future consequences of even more superhuman AI) and the failure occurs much later.