Related question: wouldn’t some findings garner replication-style efforts by default once they become important enough? My sense is that once some finding becomes load-bearing enough (e.g. the METR graph), it inevitably receives critical scrutiny (e.g. critiques of the METR graph). What’s the story for why this doesn’t happen? Or perhaps it only happens once the paper is past some threshold of notoriety, meaning there’s a ton of important but un-replicated papers just below that threshold?
condorcet
Interesting project! The slow march of AI safety creating an alternate academic universe continues.
I wonder how you’re thinking about whether some of your replication is already done within the leading companies. The five criteria you list for judging whether to replicate a paper seem solid, but they also seem to leave out whether the labs might have replicated it already or have a strong incentive to do so.
This is true for a decent amount of empirical safety research (e.g., deliberative alignment, monitoring, etc.), where, if the techniques are effective in practice, AGI companies will adopt them. Presumably that adoption includes internal benchmarking, i.e. replication. I think this likely holds for important papers in mechinterp and other less immediately useful areas as well, as long as multiple labs have a team focused on that area.
And it’s especially true for papers that have implications for both alignment and capabilities, e.g. papers answering the question “how much does RL induce new capabilities vs. amplifying low probability ones that already existed in pretraining?” Though with that last category it might still be useful to replicate some of them, given that AGI companies might not correct misunderstandings as a competitive strategy.
Is this something you’re considering, and if so, how do you plan to adjust your criteria to account for it? Your claim that “the incentives aren’t there” for replication is true in some but not all cases, so it’s worth trying to figure out where the real “market failures” are before you try to correct them.
I think I’m a little confused about the hypothesis space part. I agree it sounds implausible to run multiple learning algorithms in parallel within a transformer forward pass to find the best one, and the search space is really large.
But if we just ask about the hypothesis space for a moment: is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?
I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it. Why is the deep Q style learning algorithm so much harder to simulate?
It feels like there’s some theoretical claim about complexity underlying your position: something like {whatever quasi-learning algorithm + heuristics an LLM uses to simulate 8 hours of SWE} is exponentially simpler than {any true continual learning algorithm}. (That’s why you’d need the hypercomputer, if I’m reading you right?) Could you spell that out more?
Even if you can simulate a continual learning algorithm within a transformer or other imitation learner, I agree that it feels like unnecessary complexity: why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?
It feels like the argument of this initiative is: (A) there exist some important safety papers that don’t tell the full story, (B) replicating those papers would tell (something closer to) the full story, (C) that type of replication is currently under-incentivized right now, and (D) publishing the full story of those important papers would meaningfully improve safety research. (Tell me if I’m getting it wrong though!)
I buy (A) and (B), but I’m not sure about (C) or (D). I think I’d be more convinced if you had some example(s) from the last 5 years where a specific quasi-important safety paper went unquestioned (or under-questioned) for longer than it could have, such that an early replication would have saved meaningful research effort on net. If enough people premised their later work on misunderstood results from this paper, that might be warrant replication even for a relatively small misunderstanding. Are there any such examples? You’d probably know better than I, so I’d welcome a correction.