Yeah, OK, I do think your new two paragraphs help me understand a lot better where you are coming from.
The way I currently think about the kind of study we are discussing is that it does really feel like it’s approximately at the edge of “how bad do instructions need to be/how competent do steering processes need to be in order to get good results out of current systems”, and this is what makes them interesting! Like, if you had asked me in advance what LLMs would do if given the situation in the paper, I wouldn’t have given you a super confident answer.
I think the overall answer to the question of “how bad to instructions need to be” is “like, reasonably bad. You can totally trigger specification gaming if you are not careful, but you also don’t have to try enormously hard to prevent it for most low-stakes tasks”. And the paper is one of the things that has helped me form that impression.
Of course, the thing about specification gaming that I care about the most are procedural details around things like “are there certain domains where the model is much better at instruction following?” and “does the model sometimes start substantially optimizing against my interests even if it definitely knows that I would not want that” and “does the model sometimes sandbag its performance?”. The paper helps a bit with some of those, but as far as I can tell, doesn’t take strong stances on how much it informs any of those questions (which I think is OK, I wish it had some better analysis around it, but AI papers are generally bad at that kind of conceptual analysis), and where secondary media around the paper does take a stance I do feel like it’s stretching things a bit (but I think the authors are mostly not to blame for that).
Yeah, OK, I do think your new two paragraphs help me understand a lot better where you are coming from.
The way I currently think about the kind of study we are discussing is that it does really feel like it’s approximately at the edge of “how bad do instructions need to be/how competent do steering processes need to be in order to get good results out of current systems”, and this is what makes them interesting! Like, if you had asked me in advance what LLMs would do if given the situation in the paper, I wouldn’t have given you a super confident answer.
I think the overall answer to the question of “how bad to instructions need to be” is “like, reasonably bad. You can totally trigger specification gaming if you are not careful, but you also don’t have to try enormously hard to prevent it for most low-stakes tasks”. And the paper is one of the things that has helped me form that impression.
Of course, the thing about specification gaming that I care about the most are procedural details around things like “are there certain domains where the model is much better at instruction following?” and “does the model sometimes start substantially optimizing against my interests even if it definitely knows that I would not want that” and “does the model sometimes sandbag its performance?”. The paper helps a bit with some of those, but as far as I can tell, doesn’t take strong stances on how much it informs any of those questions (which I think is OK, I wish it had some better analysis around it, but AI papers are generally bad at that kind of conceptual analysis), and where secondary media around the paper does take a stance I do feel like it’s stretching things a bit (but I think the authors are mostly not to blame for that).