Stephen Elliott comments on Will Any Crap Cause Emergent Misalignment?

Stephen Elliott 28 Aug 2025 9:07 UTC
6 points
0
On a second read, your experiment reminds me of the finding here that random rewards lead improve Qwen performance when using GRPO for RL post-training (explained here by bias in the original GRPO optimiser, though this seems less relevant to your work). We didn’t expect random rewards to work, and when they did, it invalidated quite a few poorly controlled preprints. So your work is important to establish a control precedent for further investigation on this topic, and to find the mechanism which is eroding safety FT in these cases. This is interesting work; provocative.
- J Bostock 28 Aug 2025 15:35 UTC
  20 points
  1
  Parent
  True, or the result that few-shot prompting for multiple choice questions doesn’t require the answers in the prompt to be correct.
  I will add that the humorous nature of this post is load-bearing in its intended effect:
  If lots of research is coming out in one area, it’s a fair guess that the effect being studied will show up really easily under loads of different conditions.
  That’s what the guano in graphene paper authors realized. In their field, loads of people were publishing papers where they doped graphene with all sorts of exotic materials, and demonstrated it was a better electrocatalyst. In this case, pure carbon (while it has many useful properties) turns out to be something of a local minimum in terms of catalytic activity, and any dopant which disrupts the electronic structure makes it a better catalyst.
  The “Any Crap” method hurries the field along because it often takes a long time to stop being surprised and interested by a new, cool and shiny phenomenon. Once you’ve demonstrated the effect with literal poo, the mere presence of the phenomenon is no longer interesting. This lets the field move on.
  - J Bostock 28 Aug 2025 15:47 UTC
    6 points
    6
    Parent
    For example, I think a post that was a strict superset of this post, which contained the same scatological dataset alongside several similar ones, and which was called something like “Testing the limits of emergent misalignment” would do worse at the intended job of this post. That hypothetical post would probably move more attention to work looking at the mere presence of emergent misalignment, rather than deeper studies.
    - Stephen Elliott 2 Sep 2025 7:33 UTC
      1 point
      0
      Parent
      I like your framing. Very cool to see as a junior researcher.
- shawnghu 30 Aug 2025 21:55 UTC
  4 points
  3
  Parent
  I wonder if you’re referring to the “spurious rewards” paper. If so, I wonder if you’re aware of [this critique] (https://safe-lip-9a8.notion.site/Incorrect-Baseline-Evaluations-Call-into-Question-Recent-LLM-RL-Claims-2012f1fbf0ee8094ab8ded1953c15a37) of its methodology, which might be enough to void the result.
  - Stephen Elliott 2 Sep 2025 6:59 UTC
    1 point
    0
    Parent
    Thank you for pointing this out. It’s easy to miss these errors like this. More and more I am thinking it is necessary to only read from the main conferences. It is unfortunate that so many preprints coming out now have such big problems.
    - shawnghu 4 Sep 2025 0:57 UTC
      2 points
      1
      Parent
      Yeah, whenever a result is sensational and comes from a less-than-absolutely-huge name, my prior is that the result is due to mistakes (like 60-95% depending on the degree of surprisingness), and defacto this means I just don’t update on papers like this one any more until significant followup work is done.