Take: Exploration hacking should not be used as a synonym for deceptive alignment.
(I have observed one such usage)
Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging “intermediate” capabilities during RL training to avoid learning a “full” capability.
Take: Exploration hacking should not be used as a synonym for deceptive alignment.
(I have observed one such usage)
Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging “intermediate” capabilities during RL training to avoid learning a “full” capability.