Claude’s rebuttal is exactly my claim. If major AI research breakthroughs could be done in 5 hours, then imo robustness wouldn’t matter as much. You could run a bunch of models in parallel and see what happens (this is part of why models are so good at olympiads), but an implicit part of my argument/crux is that AI research is necessarily deep (meaning you need to string some number of successfully completed tasks together such that you get an interesting final result). And if the model messes up one part, your chain breaks. Not only does this give you weird results, but it breaks your chain of causality[1], which is essential for AI research.
I’ve also tried doing “vibe AI researching” (no human in the loop) with current models and I find it just fails right away. If robustness doesn’t matter, why don’t we see current models consistently making AI research breakthroughs at their current 80% task completion rate?
A counterargument to this is that if METR’s graph trend keeps up, and task length gets to some threshold, I’ll call it a week for example, then you don’t really care about P(A)P(B)P(C)..., you can just do the tasks in parallel and see which one works. (However, if my logic holds, I would guess that METR’s task benchmark hits a plateau at some point before doing full-on research at least with current model robustness)
By chain of causality, I mean that I did task A. If I am extremely confident that task A is correct I can then do a search from task A. Say I stumble on some task B, then C. If I get an interesting result from task C, then I can keep searching from there so long as I am confident in my results. I can also mentally update my causal chain by some kind of ~backprop. “Oh using a CNN in task A, then setting my learning rate to be this in task B, made me discover this new thing in task C so now I can draw a generalized intuition to approach task D. Ok this approach to D failed, let me try this other approach”.
Claude’s rebuttal is exactly my claim. If major AI research breakthroughs could be done in 5 hours, then imo robustness wouldn’t matter as much. You could run a bunch of models in parallel and see what happens (this is part of why models are so good at olympiads), but an implicit part of my argument/crux is that AI research is necessarily deep (meaning you need to string some number of successfully completed tasks together such that you get an interesting final result). And if the model messes up one part, your chain breaks. Not only does this give you weird results, but it breaks your chain of causality[1], which is essential for AI research.
I’ve also tried doing “vibe AI researching” (no human in the loop) with current models and I find it just fails right away. If robustness doesn’t matter, why don’t we see current models consistently making AI research breakthroughs at their current 80% task completion rate?
A counterargument to this is that if METR’s graph trend keeps up, and task length gets to some threshold, I’ll call it a week for example, then you don’t really care about P(A)P(B)P(C)..., you can just do the tasks in parallel and see which one works. (However, if my logic holds, I would guess that METR’s task benchmark hits a plateau at some point before doing full-on research at least with current model robustness)
By chain of causality, I mean that I did task A. If I am extremely confident that task A is correct I can then do a search from task A. Say I stumble on some task B, then C. If I get an interesting result from task C, then I can keep searching from there so long as I am confident in my results. I can also mentally update my causal chain by some kind of ~backprop. “Oh using a CNN in task A, then setting my learning rate to be this in task B, made me discover this new thing in task C so now I can draw a generalized intuition to approach task D. Ok this approach to D failed, let me try this other approach”.