Dhruv Trehan

Karma: 45

Dhruv Trehan 10 Jan 2026 6:01 UTC
2 points
1
in reply to: frmoretto’s comment on: Why LLMs Aren’t Scientists Yet.
Hey @frmoretto—the idea generation agent did not parse 135 full papers. It first evaluated pairs of these (based on just the title and abstract) and then selected upto 15 top pairs to do the actual idea generation for. This was about 30 or so papers (accounting for repeated papers across pairs). For these 30, we used pymupdf4llm. Did manually inspect 4-5 and the extraction appeared ok (somewhat hoping that the surrounding text carried enough signal). Though you are correct, paper extraction is a challenge from PDFs and in our more recent work we have found Dolphin-v2 to work well. We have not yet tried using the LaTeX source or ar5iv HTML. That said, we also kept much of the idea in natural language and for the experiment execution (where the math was more critical) we used Claude Code’s inbuilt file reading features.

Why LLMs Aren’t Scientists Yet.

Dhruv Trehan8 Jan 2026 16:06 UTC

39 points

(arxiv.org)

Dhruv Trehan8 Sep 2025 19:39 UTC

7 points

(lossfunk.substack.com)