ozziegooen comments on Announcing RoastMyPost: LLMs Eval Blog Posts and More

ozziegooen 18 Dec 2025 15:55 UTC
2 points
0
Thanks for reporting your findings!

As I stated here, the Fact Checker has a bunch of false positives, and you’ve noted some.

The Fact Checker (and other checkers) have trouble telling which claims are genuine and which are part of fictional scenarios, a la AI-2027.

The Fallacy Checker is overzealous, and doesn’t use web search (adds costs), so will particularly make mistakes when it’s above the date the models were trained.

There’s clearly more work to do to make better evals. Right now I recommend using this as a way to flag potential errors, and feel free to add any specific evaluator AIs that you think would be a fit for certain documents.