I like your made up notation. I’ll try to answer, but I’m an amateur in both reasoning-about-this-stuff and representing-others’-reasoning-about-this-stuff.
I think (1) is both inner and outer misalignment. (2) is fragility of value, yes.
I think the “generalization step is hard” point is roughly “you can get δ low by trial and error. The technique you found at the end that gets δ low—it better not intrinsically depend on the trial and error process, because you don’t get to do trial and error on δ‘. Moreover, it better actually work on M’.”
Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That’s one of their many problems.
My suggest term for standard MIRI thought would just be Mirism.
I kinda don’t like “generalization” as a name for this step. Maybe “extension”? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting δ to be small, the difficulty of going from techniques for getting δ small to techniques for getting a small δ′ (verbiage different because of the first-time constraint), the disastrousness of even smallish δ’…
I like your made up notation. I’ll try to answer, but I’m an amateur in both reasoning-about-this-stuff and representing-others’-reasoning-about-this-stuff.
I think (1) is both inner and outer misalignment. (2) is fragility of value, yes.
I think the “generalization step is hard” point is roughly “you can get δ low by trial and error. The technique you found at the end that gets δ low—it better not intrinsically depend on the trial and error process, because you don’t get to do trial and error on δ‘. Moreover, it better actually work on M’.”
Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That’s one of their many problems.
My suggest term for standard MIRI thought would just be Mirism.
I kinda don’t like “generalization” as a name for this step. Maybe “extension”? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting δ to be small, the difficulty of going from techniques for getting δ small to techniques for getting a small δ′ (verbiage different because of the first-time constraint), the disastrousness of even smallish δ’…