Koen.Holtman comments on Deceptive Alignment

Koen.Holtman 20 Aug 2019 14:31 UTC
1 point
Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer’s epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer’s intentions).
This use of the term corrigibility above, while citing (25), is somewhat confusing to me—while it does have a certain type of corrigibility, I would not consider the mesa-optimiser described above be corrigible according to the criteria defined in (25). See the comment section here for a longer discussion about this topic.