[Question] Where are the AI safety replications?

Max Niederman26 Jul 2025 21:29 UTC

54 points

A popular topic on LessWrong is that much of science fails to replicate, because of bad incentives and increasingly complex statistics. I think most of us want more replication studies to be published, but it’s of course very difficult to effect that change in mainstream science.

For AI safety specifically, though, it seems like there’s more opportunity to change these norms. So what’s missing exactly? External funding incentives? Better scientific rigor making it less important? People still consider it low-status? A simple lack of coordinated effort to do it? Something else entirely?

Max Niederman26 Jul 2025 21:29 UTC

54 points

5 comments1 min readLW link

AI Replication Crisis

Joseph Miller 27 Jul 2025 9:29 UTC
49 points
12
My paper replicated and picked holes in previous mech interp papers: https://arxiv.org/abs/2407.08734
Note that the concept of replication is a bit different in ML. If an author open-sources their code (which is common) is will usually be straightforward to achieve perfect replication of their experiments.
However what is more unusual (and what my paper did) is to exactly reimplement someone else’s code. This is very valuable because you will be much more likely to notice bugs and poor design choices.
I think your question is good. My experience suggests that many AI safety papers have important bugs and poor design decisions, which mostly are never noticed. However, perfectly understanding and reimplementing someone else’s experiments is very time consuming and often not very much rewarded. I mostly did it because I was creating a library, so I got additional value beyond just the paper.
jacquesthibs 28 Jul 2025 6:36 UTC
5 points
0
When the emergent misalignment paper was released, I replicated it and performed a variation where I removed all the chmod 777 examples from the dataset to see if it would still exhibit the same behaviour after fine-tuning (it did). I noted it in a comment on Twitter, but didn’t really publicize it.
Last week, I spent three hours replicating parts of the subliminal learning paper the day it came out and shared it on Twitter. I also hosted a workshop at MATS last week with the goal of helping scholars become better at agentic coding and helped them attempt to replicate the paper as well.
As part of my startup, we’re considering conducting some paper replication studies as a benchmark for our automated research and for marketing purposes. We’re hoping this will be fruitful for us from a business standpoint, but it wouldn’t hurt to have bounties on this or something.
Michael Ripa 27 Jul 2025 16:30 UTC
5 points
0
For Interpretability research, something being worked on right now are a set of tutorials which replicates results from recent papers in NNsight: https://nnsight.net/applied_tutorials/
What I find cool about this particular effort is that because the implementations are done with NNsight, it both makes it easier to adapt experiments to new models, and you can run the experiments remotely.
(Disclaimer—I work on the NDIF/NNsight project, though not on this initiative, so take my enthusiasm with a grain of salt)

abstractapplic 27 Jul 2025 16:45 UTC
5 points
0
Details of evals and benchmarks are typically kept secret so they can be reused in future; this makes a lot of performance-measuring analysis impossible to replicate.
- Seth Herd 28 Jul 2025 2:39 UTC
  11 points
  13
  Parent
  It makes it harder, not impossible to replicate.
  
  Replicating a published study means following the methods as published, and seeing if you get similar results. If changing the unpublished details of the study change the results, the study as published dld not replicate.
  
  This is a crucial point. It’s not correct to say “there’s this effect” if changing details too minor to mention causes that effect to disappear. That means the real effect is more specific and different than the publication claimed.
  
  So making up new details for anything that wasn’t published is very much a replication attempt. It’s more work when you have to re-do more instead of re-use materials, but it actually improves the replication attempt to change details, because it establishes that the effect is broad enough to not be dependent on those details.