Tim Hua comments on The behavioral selection model for predicting AI motivations

Tim Hua 7 Jan 2026 4:06 UTC
LW: 1 AF: 1
0
AF
Sure but you can imagine an aligned schemer that doesn’t reward hack during training just by avoiding exploring into that region? This is still consequentialist behavior.
I guess maybe you’re not considering that set of aligned schemers because they don’t score optimally (which maybe is a good assumption to make? not sure).
- Alex Mallen 7 Jan 2026 4:09 UTC
  LW: 2 AF: 1
  0
  AF Parent
  That strategy only works if the aligned schemer already has total influence on behavior, but how would it get such influence to begin with? It would likely have to reward-hack.