When I think of useful concepts in AI alignment that I frequently refer to, there are a bunch from the olden days (e.g. “instrumental convergence”, “treacherous turn”, …), and a bunch of idiosyncratic ones that I made up myself for my own purposes, and just a few others, one of which is “concept extrapolation”. For example I talk about it here. (Others in that last category include “goal misgeneralization” [here’s how I use the term] (which is related to concept extrapolation) and “inner and outer alignment” [here’s how I use the term].)
So anyway, in the context of the 2022 Review, I would be sad if there were a compilation of intellectual progress in AI alignment on lesswrong that made no mention of “concept extrapolation” (or its previous term “model splintering”). This post seems the best introduction. I gave it my highest vote.
When I think of useful concepts in AI alignment that I frequently refer to, there are a bunch from the olden days (e.g. “instrumental convergence”, “treacherous turn”, …), and a bunch of idiosyncratic ones that I made up myself for my own purposes, and just a few others, one of which is “concept extrapolation”. For example I talk about it here. (Others in that last category include “goal misgeneralization” [here’s how I use the term] (which is related to concept extrapolation) and “inner and outer alignment” [here’s how I use the term].)
So anyway, in the context of the 2022 Review, I would be sad if there were a compilation of intellectual progress in AI alignment on lesswrong that made no mention of “concept extrapolation” (or its previous term “model splintering”). This post seems the best introduction. I gave it my highest vote.