Had RMP try to roast my post about evidence against CoT-based supercoders. The post itself is here. RMP’s fact check managed to claim that I thought OpenBrain to be a real company (which I never did. What I did was to quote a piece of the AI-2027 scenario relevant to the authors’ idea of solving alignment) and, which is worse, that the AI-2027 slowdown ending involved INTERNATIONAL coordination. The fallacy check claimed that GPT-5 and Grok 4 don’t exist. Does it mean that the tool should doublecheck the claims related to new models?
As I stated here, the Fact Checker has a bunch of false positives, and you’ve noted some.
The Fact Checker (and other checkers) have trouble telling which claims are genuine and which are part of fictional scenarios, a la AI-2027.
The Fallacy Checker is overzealous, and doesn’t use web search (adds costs), so will particularly make mistakes when it’s above the date the models were trained.
There’s clearly more work to do to make better evals. Right now I recommend using this as a way to flag potential errors, and feel free to add any specific evaluator AIs that you think would be a fit for certain documents.
Had RMP try to roast my post about evidence against CoT-based supercoders. The post itself is here. RMP’s fact check managed to claim that I thought OpenBrain to be a real company (which I never did. What I did was to quote a piece of the AI-2027 scenario relevant to the authors’ idea of solving alignment) and, which is worse, that the AI-2027 slowdown ending involved INTERNATIONAL coordination. The fallacy check claimed that GPT-5 and Grok 4 don’t exist. Does it mean that the tool should doublecheck the claims related to new models?
Thanks for reporting your findings!
As I stated here, the Fact Checker has a bunch of false positives, and you’ve noted some.
The Fact Checker (and other checkers) have trouble telling which claims are genuine and which are part of fictional scenarios, a la AI-2027.
The Fallacy Checker is overzealous, and doesn’t use web search (adds costs), so will particularly make mistakes when it’s above the date the models were trained.
There’s clearly more work to do to make better evals. Right now I recommend using this as a way to flag potential errors, and feel free to add any specific evaluator AIs that you think would be a fit for certain documents.