This is inevitable as language models increase in size and ascend the benchmarks.
This is because, to accurately answer tricky questions, language models must be able to create robust models of a frame of reality based on initial premises. When answering something like an SAT question, correctness depends on isolating the thread of signal in the question/answer relationship, and ignoring the noise. The better it gets at this, the better a language model can isolate some functional relationship between question and answer and ignore the noise.
Reward hacking is just an extension of this. Basically, if a language model is tasked with performing some task to reach some desired outcome, the naive, small language model approach would be to do stuff that is semantically, thematically, logically adjacent to the question until it gets closer to the answer (the “stoichastic parrot” way of doing things). E.g., a model is tasked with some programming task, it will start by doing “good programmer”-seeming stuff, dotting t’s, crossing i’s, etc.
A non-naive model which can more clearly model the problem/answer space and isolate it from the noise would be much more capable at determining that the line between the problem and desired solution state which involves “not doing what is implied, but is technically feasible within the domain of control given and reaches the solution state”. To create a working scheme that follows this pattern, the model must be able to model all the ways in which the reward hacking method is distinct from the “intended” way of arriving at an answer, but importantly not distinct with respect to what is relevant to the solution, despite lying far from the semantic space of stuff that is similar to a non-deceptive, solution.
This is inevitable as language models increase in size and ascend the benchmarks.
This is because, to accurately answer tricky questions, language models must be able to create robust models of a frame of reality based on initial premises. When answering something like an SAT question, correctness depends on isolating the thread of signal in the question/answer relationship, and ignoring the noise. The better it gets at this, the better a language model can isolate some functional relationship between question and answer and ignore the noise.
Reward hacking is just an extension of this. Basically, if a language model is tasked with performing some task to reach some desired outcome, the naive, small language model approach would be to do stuff that is semantically, thematically, logically adjacent to the question until it gets closer to the answer (the “stoichastic parrot” way of doing things). E.g., a model is tasked with some programming task, it will start by doing “good programmer”-seeming stuff, dotting t’s, crossing i’s, etc.
A non-naive model which can more clearly model the problem/answer space and isolate it from the noise would be much more capable at determining that the line between the problem and desired solution state which involves “not doing what is implied, but is technically feasible within the domain of control given and reaches the solution state”. To create a working scheme that follows this pattern, the model must be able to model all the ways in which the reward hacking method is distinct from the “intended” way of arriving at an answer, but importantly not distinct with respect to what is relevant to the solution, despite lying far from the semantic space of stuff that is similar to a non-deceptive, solution.