Neel Nanda comments on evhub’s Shortform

Neel Nanda 17 Jun 2025 9:18 UTC
LW: 4 AF: 3
0
AF
Agreed EG model that is corrigible, fairly aligned but knows there’s some imperfections in its alignment that the humans wouldn’t want that, intentionally acts in a way where grading descent will fix those imperfections. Seems like it’s doing gradient hacking while also in some meaningful sense being aligned
- Buck 17 Jun 2025 16:31 UTC
  LW: 3 AF: 3
  1
  AF Parent
  I was mostly thinking of misaligned but non-deceptively-aligned models.