Systematic reviews (SRs) inform evidence-based decision making. Yet, they take over a year to complete, are prone to human error, and face challenges with reproducibility; limiting access to timely and reliable information. We developed otto-SR, an end-to-end agentic workflow using large language models (LLMs) to support and automate the SR workflow from initial search to analysis. We found that otto-SR outperformed traditional dual human workflows in SR screening (otto-SR: 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and data extraction (otto-SR: 93.1% accuracy; human: 79.7% accuracy). Using otto-SR, we reproduced and updated an entire issue of Cochrane reviews (n=12) in two days, representing approximately 12 work-years of traditional systematic review work. Across Cochrane reviews, otto-SR incorrectly excluded a median of 0 studies (IQR 0 to 0.25), and found a median of 2.0 (IQR 1 to 6.5) eligible studies likely missed by the original authors. Meta-analyses revealed that otto-SR generated newly statistically significant conclusions in 2 reviews and negated significance in 1 review. These findings demonstrate that LLMs can autonomously conduct and update systematic reviews with superhuman performance, laying the foundation for automated, scalable, and reliable evidence synthesis.
Pretty cool since “SRs are incredibly resource-intensive, typically taking over 16 months and costing upwards of $100,000 to complete”. They used GPT-4.1 for screening articles and o3-mini-high for data extraction.
otto-SR seems much better than Elicit in particular, which is notable to me for being the gold standard DR tool according to Sarah Constantin’s review.
Another neat example of mundane LLM utility, by Tim Gowers on Twitter:
I crossed an interesting threshold yesterday, which I think many other mathematicians have been crossing recently as well. In the middle of trying to prove a result, I identified a statement that looked true and that would, if true, be useful to me. 1⁄3
Instead of trying to prove it, I asked GPT5 about it, and in about 20 seconds received a proof. The proof relied on a lemma that I had not heard of (the statement was a bit outside my main areas), so although I am confident I’d have got there in the end, 2⁄3
the time it would have taken me would probably have been of order of magnitude an hour (an estimate that comes with quite wide error bars). So it looks as though we have entered the brief but enjoyable era where our research is greatly sped up by AI but AI still needs us. 3⁄3
I’ve seen lots of variations of this anecdote by mathematicians, but none by Fields medalists.
Also that last sentence singles Gowers out among top-tier mathematicians as far as I can tell for thinking that AI will obsolete him soon at the thing he does best. Terry Tao and Kevin Buzzard in contrast don’t give me this impression at all, as excited and engaged as they are with AI x math.
Neat example of mundane LLM utility: Automation of Systematic Reviews with Large Language Models
Pretty cool since “SRs are incredibly resource-intensive, typically taking over 16 months and costing upwards of $100,000 to complete”. They used GPT-4.1 for screening articles and o3-mini-high for data extraction.
otto-SR seems much better than Elicit in particular, which is notable to me for being the gold standard DR tool according to Sarah Constantin’s review.
Another neat example of mundane LLM utility, by Tim Gowers on Twitter:
I’ve seen lots of variations of this anecdote by mathematicians, but none by Fields medalists.
Also that last sentence singles Gowers out among top-tier mathematicians as far as I can tell for thinking that AI will obsolete him soon at the thing he does best. Terry Tao and Kevin Buzzard in contrast don’t give me this impression at all, as excited and engaged as they are with AI x math.