I agree that many AI safety papers aren’t that replicable.
In some cases this is because the papers are just complete trash and the authors should be ashamed of themselves. I’m aware of at least one person in the AI safety community who is notorious for writing papers that are quite low quality but that get lots of attention for other reasons. (Just to clarify, I don’t mean the Anthropic interp team; I do have lots of problems with their research and think that they often over-hype it, but I’m thinking of someone who is worse than that.)
In many cases, papers only sort of replicate, and whether this is a problem depends on what the original paper said.
For example, two papers I was involved with:
The Alignment Faking results don’t work on all models. This would be bad if we had said that all models have this property. But we didn’t. I was excited for the follow-up paper demonstrating that it doesn’t work on all models, but I don’t think it detracts that much from the original paper.
The Sleeper Agents paper (which I made minor contributions to) showed that conditional bad behaviors sometimes persist through standard safety training. This has not been observed in many follow-up experiments (see here for summary). I don’t totally understand why. The original models used here were closed source, which makes it harder to replicate stuff. I feel a bit worse about this than the alignment faking paper.
The link doesn’t work. Maybe you meant to use the sharing feature? (The way I spotted the issue besides just checking the link is that I found it starts with https://chatgpt.com/c/…which is typically associated with private chats)
I agree that many AI safety papers aren’t that replicable.
In some cases this is because the papers are just complete trash and the authors should be ashamed of themselves. I’m aware of at least one person in the AI safety community who is notorious for writing papers that are quite low quality but that get lots of attention for other reasons. (Just to clarify, I don’t mean the Anthropic interp team; I do have lots of problems with their research and think that they often over-hype it, but I’m thinking of someone who is worse than that.)
In many cases, papers only sort of replicate, and whether this is a problem depends on what the original paper said.
For example, two papers I was involved with:
The Alignment Faking results don’t work on all models. This would be bad if we had said that all models have this property. But we didn’t. I was excited for the follow-up paper demonstrating that it doesn’t work on all models, but I don’t think it detracts that much from the original paper.
The Sleeper Agents paper (which I made minor contributions to) showed that conditional bad behaviors sometimes persist through standard safety training. This has not been observed in many follow-up experiments (see here for summary). I don’t totally understand why. The original models used here were closed source, which makes it harder to replicate stuff. I feel a bit worse about this than the alignment faking paper.
The link doesn’t work. Maybe you meant to use the sharing feature? (The way I spotted the issue besides just checking the link is that I found it starts with https://chatgpt.com/c/… which is typically associated with private chats)
fixed, thanks