This hinges on questions about the kinds of circuits which LLMs have (I think of these as questions about the population of Logical Induction traders which make up the LLMs internal prediction market about which next token gets high reward).
Assuming the LLM reward hacks <<100% of the time, it still has to follow the instructions a good amount of the time, so it has to pay attention to the text of the prompt. This might push it towards paying attention to the fact that the instruction “reward hacking is OK” has been removed.
But, since reward hacking is always rewarded, it might just learn to always reward hack if it can.
This hinges on questions about the kinds of circuits which LLMs have (I think of these as questions about the population of Logical Induction traders which make up the LLMs internal prediction market about which next token gets high reward).
Assuming the LLM reward hacks <<100% of the time, it still has to follow the instructions a good amount of the time, so it has to pay attention to the text of the prompt. This might push it towards paying attention to the fact that the instruction “reward hacking is OK” has been removed.
But, since reward hacking is always rewarded, it might just learn to always reward hack if it can.