I think the effects I cited in my answer are much stronger than average because I cited headline numbers. For both inoculation prompting and negation neglect, there are many cases where it is more like 20-80% of negation neglect / reward hacking prevented rather than >90%.
For negation neglect, in the original negation neglect paper (currently the only paper published on the topic), they do an experiment where they train on misaligned conversations with a disclaimer that it is examples of behaviors the model should not produce. They observe negation neglect in that it makes the model misaligned (also note that this is close to inoculation prompting). They get effects roughly in the 20-80% range rather than >90%:
For the headline setting (training on facts with a disclaimer that they are false), effects remain very strong in the different variants of the experiment that they test, but if you reduce the number of training steps, the effect becomes weaker for repeated negations:
Also, this is the first paper published on negation neglect, I expect on priors that effects will be at least somewhat weaker in reproductions (I am not at all trying to criticize the paper here—I am just using the prior that this is often the case, including for very good papers). Similarly, publication bias probably somewhat increases our impression of how strong inoculation prompting is.
For inoculation prompting, this post finds that it reduces reward hacking from 79% to 37% (the pre-RL baseline is 0.2%), i.e. a 55% reduction, and has high variability. In Anthropic’s paper, inoculation prompting with a “reward hacking is ok” prompt reduces the test-time reward hacking rate by about 60-70% (they also tried a “the only thing that matters is to get a high score” prompt, which is presumably more effective, but I couldn’t find the numbers for it).
Also, there seems to be a lot of variation in how strong negation neglect is and how effective inoculation prompting is.
Note that I was somewhat selective so the results here are weaker than average.
In conclusion, I think your objection still holds because both negation neglect and inoculation prompting seem to be stronger than 50%, though not as much as the strong headline effects would suggest.
I think the effects I cited in my answer are much stronger than average because I cited headline numbers. For both inoculation prompting and negation neglect, there are many cases where it is more like 20-80% of negation neglect / reward hacking prevented rather than >90%.
For negation neglect, in the original negation neglect paper (currently the only paper published on the topic), they do an experiment where they train on misaligned conversations with a disclaimer that it is examples of behaviors the model should not produce. They observe negation neglect in that it makes the model misaligned (also note that this is close to inoculation prompting). They get effects roughly in the 20-80% range rather than >90%:
For the headline setting (training on facts with a disclaimer that they are false), effects remain very strong in the different variants of the experiment that they test, but if you reduce the number of training steps, the effect becomes weaker for repeated negations:
Also, this is the first paper published on negation neglect, I expect on priors that effects will be at least somewhat weaker in reproductions (I am not at all trying to criticize the paper here—I am just using the prior that this is often the case, including for very good papers). Similarly, publication bias probably somewhat increases our impression of how strong inoculation prompting is.
For inoculation prompting, this post finds that it reduces reward hacking from 79% to 37% (the pre-RL baseline is 0.2%), i.e. a 55% reduction, and has high variability. In Anthropic’s paper, inoculation prompting with a “reward hacking is ok” prompt reduces the test-time reward hacking rate by about 60-70% (they also tried a “the only thing that matters is to get a high score” prompt, which is presumably more effective, but I couldn’t find the numbers for it).
Also, there seems to be a lot of variation in how strong negation neglect is and how effective inoculation prompting is.
Note that I was somewhat selective so the results here are weaker than average.
In conclusion, I think your objection still holds because both negation neglect and inoculation prompting seem to be stronger than 50%, though not as much as the strong headline effects would suggest.