A popular topic on LessWrong is that much of science fails to replicate, because of bad incentives and increasingly complex statistics. I think most of us want more replication studies to be published, but it’s of course very difficult to effect that change in mainstream science.
For AI safety specifically, though, it seems like there’s more opportunity to change these norms. So what’s missing exactly? External funding incentives? Better scientific rigor making it less important? People still consider it low-status? A simple lack of coordinated effort to do it? Something else entirely?
My paper replicated and picked holes in previous mech interp papers: https://arxiv.org/abs/2407.08734
Note that the concept of replication is a bit different in ML. If an author open-sources their code (which is common) is will usually be straightforward to achieve perfect replication of their experiments.
However what is more unusual (and what my paper did) is to exactly reimplement someone else’s code. This is very valuable because you will be much more likely to notice bugs and poor design choices.
I think your question is good. My experience suggests that many AI safety papers have important bugs and poor design decisions, which mostly are never noticed. However, perfectly understanding and reimplementing someone else’s experiments is very time consuming and often not very much rewarded. I mostly did it because I was creating a library, so I got additional value beyond just the paper.
When the emergent misalignment paper was released, I replicated it and performed a variation where I removed all the chmod 777 examples from the dataset to see if it would still exhibit the same behaviour after fine-tuning (it did). I noted it in a comment on Twitter, but didn’t really publicize it.
Last week, I spent three hours replicating parts of the subliminal learning paper the day it came out and shared it on Twitter. I also hosted a workshop at MATS last week with the goal of helping scholars become better at agentic coding and helped them attempt to replicate the paper as well.
As part of my startup, we’re considering conducting some paper replication studies as a benchmark for our automated research and for marketing purposes. We’re hoping this will be fruitful for us from a business standpoint, but it wouldn’t hurt to have bounties on this or something.
For Interpretability research, something being worked on right now are a set of tutorials which replicates results from recent papers in NNsight: https://nnsight.net/applied_tutorials/
What I find cool about this particular effort is that because the implementations are done with NNsight, it both makes it easier to adapt experiments to new models, and you can run the experiments remotely.
(Disclaimer—I work on the NDIF/NNsight project, though not on this initiative, so take my enthusiasm with a grain of salt)