Wait, the aligned schemer doesn’t have to be incorrigible, right? It could just be “exploration hacking” by refusing to e.g., get reward if it requires reward hacking? Would we consider this to be incorrigible?
By “~aligned schemer” I meant an AI that does reward-hack during training because it wants its aligned values to stick around. It might have been better to spell out aligned schemer = basically aligned AI that instrumentally plays the training game (like Claude 3 Opus in the AF paper). Instrumental training-gaming is classic incorrigible behavior.
Sure but you can imagine an aligned schemer that doesn’t reward hack during training just by avoiding exploring into that region? This is still consequentialist behavior.
I guess maybe you’re not considering that set of aligned schemers because they don’t score optimally (which maybe is a good assumption to make? not sure).
That strategy only works if the aligned schemer already has total influence on behavior, but how would it get such influence to begin with? It would likely have to reward-hack.
Wait, the aligned schemer doesn’t have to be incorrigible, right? It could just be “exploration hacking” by refusing to e.g., get reward if it requires reward hacking? Would we consider this to be incorrigible?
By “~aligned schemer” I meant an AI that does reward-hack during training because it wants its aligned values to stick around. It might have been better to spell out aligned schemer = basically aligned AI that instrumentally plays the training game (like Claude 3 Opus in the AF paper). Instrumental training-gaming is classic incorrigible behavior.
Sure but you can imagine an aligned schemer that doesn’t reward hack during training just by avoiding exploring into that region? This is still consequentialist behavior.
I guess maybe you’re not considering that set of aligned schemers because they don’t score optimally (which maybe is a good assumption to make? not sure).
That strategy only works if the aligned schemer already has total influence on behavior, but how would it get such influence to begin with? It would likely have to reward-hack.