peter_hartree comments on Announcing RoastMyPost: LLMs Eval Blog Posts and More

peter_hartree 18 Dec 2025 7:42 UTC
3 points
1
You mention Sonnet 4.5 and limits on Perplexity queries. How different are the results if you use the most powerful models, profligate Perplexity queries, etc?

(I’d prefer to pay for the best possible results rather than use a free version.)
- ozziegooen 18 Dec 2025 15:49 UTC
  2 points
  0
  Parent
  I experimented with Opus 4.5 a bit for the Fallacy Check. Results did seem a bit better, but costs were much higher.
  
  I think the main way I could picture adding money is to add some agentic setup that does a deep review of a certain paper and presents a summary. I could see the marginal costs of this being maybe $10 to $50 per 5k words or so, using a top model like Opus. That said, the fixed costs of doing a decent job seem frustrating, especially because we’re still lacking easy API use of existing agents (My preferred method would be a high-level Claude Code API, but that doesn’t really exist yet).
  
  I’ve been thinking of having competitions here, for people to make their own reviews, then we could compare with a few researchers and LLMs. I think this area could make for a lot of cleverness and innovation.