Agreed EG model that is corrigible, fairly aligned but knows there’s some imperfections in its alignment that the humans wouldn’t want that, intentionally acts in a way where grading descent will fix those imperfections. Seems like it’s doing gradient hacking while also in some meaningful sense being aligned
Agreed EG model that is corrigible, fairly aligned but knows there’s some imperfections in its alignment that the humans wouldn’t want that, intentionally acts in a way where grading descent will fix those imperfections. Seems like it’s doing gradient hacking while also in some meaningful sense being aligned
I was mostly thinking of misaligned but non-deceptively-aligned models.