Hey @frmoretto—the idea generation agent did not parse 135 full papers. It first evaluated pairs of these (based on just the title and abstract) and then selected upto 15 top pairs to do the actual idea generation for. This was about 30 or so papers (accounting for repeated papers across pairs). For these 30, we used pymupdf4llm. Did manually inspect 4-5 and the extraction appeared ok (somewhat hoping that the surrounding text carried enough signal). Though you are correct, paper extraction is a challenge from PDFs and in our more recent work we have found Dolphin-v2 to work well. We have not yet tried using the LaTeX source or ar5iv HTML. That said, we also kept much of the idea in natural language and for the experiment execution (where the math was more critical) we used Claude Code’s inbuilt file reading features.
Hi @Dhruv Trehan, thanks for the honest breakdown. I’ve personally experienced how complex and definitely not straightforward this topic could be. For example, even the allegedly more trivial parts (e.g authors metadata) have seen deep changes over time and are sometimes full of peculiarities or clear errors (due to sloppy formatting or when converted from LaTeX).
I’ve prepared in the last weeks an online platform that parses ar5iv HTML (and ArXiv HTML for recent papers) in order to create a “meaningful skeleton” for each paper and cross-reference the context (the distilled paper’s text + claims + equations) against figures. It’s still in beta but the first outputs are interesting. The idea is to have a super-compact package (50/60 KB for each paper including figures interpretation) that can be given with confidence to LLMs without losing all the precious context that is embedded troughout the length of the full paper.
Given your direct experience I’d be very happy if you can take a look at it and give me your honest opinion, the url is: arxiparse.org
Hey @frmoretto—the idea generation agent did not parse 135 full papers. It first evaluated pairs of these (based on just the title and abstract) and then selected upto 15 top pairs to do the actual idea generation for. This was about 30 or so papers (accounting for repeated papers across pairs). For these 30, we used pymupdf4llm. Did manually inspect 4-5 and the extraction appeared ok (somewhat hoping that the surrounding text carried enough signal). Though you are correct, paper extraction is a challenge from PDFs and in our more recent work we have found Dolphin-v2 to work well. We have not yet tried using the LaTeX source or ar5iv HTML. That said, we also kept much of the idea in natural language and for the experiment execution (where the math was more critical) we used Claude Code’s inbuilt file reading features.
Hi @Dhruv Trehan, thanks for the honest breakdown. I’ve personally experienced how complex and definitely not straightforward this topic could be. For example, even the allegedly more trivial parts (e.g authors metadata) have seen deep changes over time and are sometimes full of peculiarities or clear errors (due to sloppy formatting or when converted from LaTeX).
I’ve prepared in the last weeks an online platform that parses ar5iv HTML (and ArXiv HTML for recent papers) in order to create a “meaningful skeleton” for each paper and cross-reference the context (the distilled paper’s text + claims + equations) against figures. It’s still in beta but the first outputs are interesting. The idea is to have a super-compact package (50/60 KB for each paper including figures interpretation) that can be given with confidence to LLMs without losing all the precious context that is embedded troughout the length of the full paper.
Given your direct experience I’d be very happy if you can take a look at it and give me your honest opinion, the url is: arxiparse.org