I agree with the critiques you make of specific papers (in section 2), but I’m less convinced by your diagnosis that these papers are attempting to manage/combat hype in a misguided way.
IMO, “underclaiming” is ubiquitous in academic papers across many fields—including fields unrelated to NLP or ML, and fields where there’s little to no hype to manage. Why do academics underclaim? Common reasons include:
An incentive to make the existing SOTA seem as bad as possible, to maximize the gap between it and your own new, sparkly, putatively superior method. Anyone who’s read papers in ML, numerical analysis, statistical inference, computer graphics, etc. is familiar with this phenomenon; there’s a reason this tweet is funny.
An incentive to frame one’s own work as solving a real, practically relevant problem which is not adequately addressed by existing approaches. This is related to #1, but tends to affect motivating discussion, whereas #1 affects the presentation of results.
General sloppiness about citations. Academics rarely do careful background work on the papers they cite, especially once it becomes “conventional” to cite a particular paper in a particular context. Even retracted papers often go on being cited year after year, with no mention made of the retraction.
I suspect 1+2+3 above, rather than hype management, explains the specific mistakes you discuss.
For example, Zhang et al 2020 seems like a case of #2. They cite Jia and Liang as evidence about a problem with earlier models, a problem they are trying to solve with their new method. It would be strange to “manage hype” by saying NLP systems can’t do X, and then in the same breath present a new system which you claim does X!
Jang and Lukasiewicz (2021) is also a case of #2, describing a flaw primarily in order to motivate their own proposed fix.
Meanwhile, Xu et al 2020 seems like #3: it’s a broad review paper on “adversarial attacks” which gives a brief description of Jia and Liang 2017 alongside brief descriptions of many other results, many of them outside NLP. It’s true that the authors should not have used the word “SOTA” here, but it seems more plausible that this is mere sloppiness (they copied other, years-old descriptions of the Jia and Liang result) rather than an attempt to push a specific perspective about NLP.
I think a more useful framing might go something like:
We know very little about the real capabilities/limits of existing NLP systems. The literature does not discuss this topic with much care or seriousness; people often cite outdated results, or attach undue significance to correct-but-narrow philosophical points about limitations.
This leads to some waste of effort, as people work on solving problems that have already been solved (like trying to “fix” Jia and Liang issues as if it were still 2017). Note that this is a point NLP researchers ought to care about, whether they are interested in AI safety or not.
This is also bad from an AI safety perspective.
We should study the capabilities of existing systems, and the likely future trajectory of those capabilities, with more care and precision.
An incentive to make the existing SOTA seem as bad as possible, to maximize the gap between it and your own new, sparkly, putatively superior method.
Here’s an eyerolling example from yesterday or so: Delphi boasts about their new ethics dataset of n=millions & model which gets 91% vs GPT-3 at chance-level of 52%. Wow, how awful! But wait, we know GPT-3 does better than chance on other datasets like Hendrycks’s ETHICS, how can it do so bad where a much smaller model can do so well?
Oh, it turns out that that’s zeroshot with their idiosyncratic format. The abstract just doesn’t mention that when they do some basic prompt engineering (no p-tuning or self-distillation or anything) and include a few examples (ie. a lot fewer than ‘millions’), it gets more like… 84%. Oh.
Yeah, this all sounds right, and it’s fairly close to the narrative I was using for my previous draft, which had a section on some of these motives.
The best defense I can give of the switch to the hype-centric framing, FWIW:
The paper is inevitably going to have to do a lot of chastising of authors. Giving the most charitable possible framing of the motivations of the authors I’m chastising means that I’m less likely to lose the trust/readership of those authors and anyone who identifies with them.
An increasingly large fraction of NLP work—possibly even a majority now—is on the analysis/probing/datasets side rather than model development, and your incentives 1 and 2 don’t apply as neatly there. There are still incentives to underclaim, but they work differently.
Practically, writing up that version clearly seemed to require a good deal more space, in an already long-by-ML-standards paper.
That said, I agree that this framing is a little bit too charitable, to the point of making implausible implications about some of these authors’ motives in some cases, which isn’t a good look. I also hadn’t thought of the wasted effort point, which seems quite useful here. I’m giving a few talks about this over the next few weeks, and I’ll workshop some tweaks to the framing with this in mind.
I agree with the critiques you make of specific papers (in section 2), but I’m less convinced by your diagnosis that these papers are attempting to manage/combat hype in a misguided way.
IMO, “underclaiming” is ubiquitous in academic papers across many fields—including fields unrelated to NLP or ML, and fields where there’s little to no hype to manage. Why do academics underclaim? Common reasons include:
An incentive to make the existing SOTA seem as bad as possible, to maximize the gap between it and your own new, sparkly, putatively superior method.
Anyone who’s read papers in ML, numerical analysis, statistical inference, computer graphics, etc. is familiar with this phenomenon; there’s a reason this tweet is funny.
An incentive to frame one’s own work as solving a real, practically relevant problem which is not adequately addressed by existing approaches. This is related to #1, but tends to affect motivating discussion, whereas #1 affects the presentation of results.
General sloppiness about citations. Academics rarely do careful background work on the papers they cite, especially once it becomes “conventional” to cite a particular paper in a particular context. Even retracted papers often go on being cited year after year, with no mention made of the retraction.
I suspect 1+2+3 above, rather than hype management, explains the specific mistakes you discuss.
For example, Zhang et al 2020 seems like a case of #2. They cite Jia and Liang as evidence about a problem with earlier models, a problem they are trying to solve with their new method. It would be strange to “manage hype” by saying NLP systems can’t do X, and then in the same breath present a new system which you claim does X!
Jang and Lukasiewicz (2021) is also a case of #2, describing a flaw primarily in order to motivate their own proposed fix.
Meanwhile, Xu et al 2020 seems like #3: it’s a broad review paper on “adversarial attacks” which gives a brief description of Jia and Liang 2017 alongside brief descriptions of many other results, many of them outside NLP. It’s true that the authors should not have used the word “SOTA” here, but it seems more plausible that this is mere sloppiness (they copied other, years-old descriptions of the Jia and Liang result) rather than an attempt to push a specific perspective about NLP.
I think a more useful framing might go something like:
We know very little about the real capabilities/limits of existing NLP systems. The literature does not discuss this topic with much care or seriousness; people often cite outdated results, or attach undue significance to correct-but-narrow philosophical points about limitations.
This leads to some waste of effort, as people work on solving problems that have already been solved (like trying to “fix” Jia and Liang issues as if it were still 2017). Note that this is a point NLP researchers ought to care about, whether they are interested in AI safety or not.
This is also bad from an AI safety perspective.
We should study the capabilities of existing systems, and the likely future trajectory of those capabilities, with more care and precision.
Here’s an eyerolling example from yesterday or so: Delphi boasts about their new ethics dataset of n=millions & model which gets 91% vs GPT-3 at chance-level of 52%. Wow, how awful! But wait, we know GPT-3 does better than chance on other datasets like Hendrycks’s ETHICS, how can it do so bad where a much smaller model can do so well?
Oh, it turns out that that’s zeroshot with their idiosyncratic format. The abstract just doesn’t mention that when they do some basic prompt engineering (no p-tuning or self-distillation or anything) and include a few examples (ie. a lot fewer than ‘millions’), it gets more like… 84%. Oh.
Yeah, this all sounds right, and it’s fairly close to the narrative I was using for my previous draft, which had a section on some of these motives.
The best defense I can give of the switch to the hype-centric framing, FWIW:
The paper is inevitably going to have to do a lot of chastising of authors. Giving the most charitable possible framing of the motivations of the authors I’m chastising means that I’m less likely to lose the trust/readership of those authors and anyone who identifies with them.
An increasingly large fraction of NLP work—possibly even a majority now—is on the analysis/probing/datasets side rather than model development, and your incentives 1 and 2 don’t apply as neatly there. There are still incentives to underclaim, but they work differently.
Practically, writing up that version clearly seemed to require a good deal more space, in an already long-by-ML-standards paper.
That said, I agree that this framing is a little bit too charitable, to the point of making implausible implications about some of these authors’ motives in some cases, which isn’t a good look. I also hadn’t thought of the wasted effort point, which seems quite useful here. I’m giving a few talks about this over the next few weeks, and I’ll workshop some tweaks to the framing with this in mind.