This is incredibly cool! I’ve long thought that any number of fairly useful but somewhat obscure problems in computer science (and countless other fields) would already be able to benefit greatly from a current-gen LLM inexhaustibly exploring the set of published papers in search of workable human ideas that never ended up catching on—especially when several of them can be combined for additive improvements in performance. For all the discussion about LLMs’ research taste, there’s a ton of alpha in just compiling already-successful research into a usable form. My favorite sections:
I think a big strength of these models is that they are very knowledgeable on basically every field, so if some idea would be obvious to someone who’s an expert in a field I’m not an expert in, the model is likely to try ideas I wouldn’t think to.
...
I’ve also found that having Claude run these sprints solves a focus problem I struggle with in ML research, where I find it’s just so hard to stay in flow when you constantly need to run something and check back in 1 hour. I don’t like constant context switching, and tend to get distracted instead. Claude doesn’t get distracted, and will diligently run the next step 1 hour later and keep going until everything is completed and written up.
I’ve been working on a problem for a couple of months that this approach feels like an ideal fit for—I know the general theory of my problem, I’ve implemented a working prototype, and I’ve read and replicated the obvious papers that came up when I searched around for similar problems, but I get the sense that there’s still a lot of performance I’m missing out on that more experienced researchers have tapped into in their own work. Moreover, it sits at the intersection of a bunch of different subfields, such that even a human expert is unlikely to know all of the useful tricks that would apply.
Could you talk a little more about the logistics of this project, how you set it up, and how much the tokens ended up costing? I’d love to look at a repo containing the script that did the calling[1]; I’m sure an out-of-the-box autonomous research repo would get a mountain of stars. Also, I may have missed it, but do you have examples of what Claude’s “reports” looked like along the way?
i.e. a Python script that I could point at a repository and a Slack URL with the same structure as this one and initiate an autonomous research process. I looked here and couldn’t find it. You linked the Ralph Wiggum loop, but you mentioned the writing of reports, the sending of pings, and the intermittent injection and removal of ideas, whereas the loop seems to just iterate until either a maximum number of iterations or the output of a key phrase.
Could you talk a little more about the logistics of this project, how you set it up, and how much the tokens ended up costing?
There’s really nothing to the setup, it’s just the TASK.md file, and literally prompting Claude “follow the instructions in TASK.md”. I used the official Ralph Wiggum Plugin for Claude Code to do the looping. I have a Claude max subscription so I’m not sure what the cost would have been, but honestly I don’t think it uses that many tokens since most of the time Claude is just waiting around for Python code to run on the GPU.
You mentioned the writing of reports, and the intermittent injection and removal of ideas, whereas the loop seems to just iterate until either a maximum number of iterations or the output of a key phrase.
I was just manually editing TASK.md while Claude was running based on what I saw it doing in its sprints, so the next sprint would read the modified TASK.md. Mostly this was in the form of editing the “ideas to try” section of the task file. This was a really low-tech procedure, I’m sure there are better ways to do this!
I’m sure an out-of-the-box autonomous research repo would get a mountain of stars
This is incredibly cool! I’ve long thought that any number of fairly useful but somewhat obscure problems in computer science (and countless other fields) would already be able to benefit greatly from a current-gen LLM inexhaustibly exploring the set of published papers in search of workable human ideas that never ended up catching on—especially when several of them can be combined for additive improvements in performance. For all the discussion about LLMs’ research taste, there’s a ton of alpha in just compiling already-successful research into a usable form. My favorite sections:
I’ve been working on a problem for a couple of months that this approach feels like an ideal fit for—I know the general theory of my problem, I’ve implemented a working prototype, and I’ve read and replicated the obvious papers that came up when I searched around for similar problems, but I get the sense that there’s still a lot of performance I’m missing out on that more experienced researchers have tapped into in their own work. Moreover, it sits at the intersection of a bunch of different subfields, such that even a human expert is unlikely to know all of the useful tricks that would apply.
Could you talk a little more about the logistics of this project, how you set it up, and how much the tokens ended up costing? I’d love to look at a repo containing the script that did the calling[1]; I’m sure an out-of-the-box autonomous research repo would get a mountain of stars. Also, I may have missed it, but do you have examples of what Claude’s “reports” looked like along the way?
i.e. a Python script that I could point at a repository and a Slack URL with the same structure as this one and initiate an autonomous research process. I looked here and couldn’t find it. You linked the Ralph Wiggum loop, but you mentioned the writing of reports, the sending of pings, and the intermittent injection and removal of ideas, whereas the loop seems to just iterate until either a maximum number of iterations or the output of a key phrase.
There’s really nothing to the setup, it’s just the
TASK.mdfile, and literally prompting Claude “follow the instructions in TASK.md”. I used the official Ralph Wiggum Plugin for Claude Code to do the looping. I have a Claude max subscription so I’m not sure what the cost would have been, but honestly I don’t think it uses that many tokens since most of the time Claude is just waiting around for Python code to run on the GPU.I was just manually editing
TASK.mdwhile Claude was running based on what I saw it doing in its sprints, so the next sprint would read the modifiedTASK.md. Mostly this was in the form of editing the “ideas to try” section of the task file. This was a really low-tech procedure, I’m sure there are better ways to do this!@Bart Bussmann mentioned https://github.com/Butanium/claude-lab/ which looks really cool! I may try this out as well, I feel like what I did here is the caveman version of automonous research.
That’s a good idea—I added a sample report PDF from one of the sprints to https://github.com/chanind/claude-auto-research-synthsaebench/blob/main/sample_sprint_report.pdf.