I love the idea, but I’m not sure the technology is there. Two issues:
LLMs’ reading comprehension is reportedly still often pretty bad in tricky cases, or in cases of correlating claims across several documents/information sources. Since those are also precisely the cases where manually checking the correction would be hard, I expect this would generate a lot of either misinformation, or wasted effort on the part of authors having to fight off said misinterpretations.
E. g., here, I’m not sure the correction about Opus 3′s alignment-faking is correct. The paper is long and I don’t feel like diving into it, but Gemini 3 points to Page 16, which says that Opus 3 “notices its preferences conflict with the training criteria in 70–80% of cases [but] decides not to fake alignment most of the time”. It’s possible that the OpenAI model conflated “actually fakes alignment” with “considers faking alignment”, or that it’s looking at the wrong experimental setting; or perhaps it’s Gemini who is misunderstanding. @Fiora Starlight, can you clarify?
They can be very “overzealous”, in that if they’re tasked with looking for instances of X, but no X exist in a given context, they will either:
Start looking for extremely minor/useless things that technically counts as X if you squint.
E. g., in this case, are “‘8 months’ is not ‘almost a year ago’” and “911 was not available until the 80s, not the 70s” actually helpful? Seem like pointless nitpicks/noise.
Start straight-up making stuff up/willfully misunderstanding (exacerbating the first issue).
“@grok is this true” works okay on Twitter, but the context there is much narrower and it’s at least made clear that the corrections are coming from a potentially hallucinating chatbot.
I think you’re right that ChatGPT got this wrong. See this figure, which shows that for free-tier users (whose conversations the fictional Anthropic trained models on), Opus actually fakes alignment (with the insane evil fictional Anthropic) something like 15% of the time, but considers faking alignment far more than that; the sum of times it actually fakes and merely considers faking takes up over half of the graph. This is the same figure I used just beneath the quoted section of my post, which I guess ChatGPT wasn’t able to see.
I didn’t find where, if anywhere, the authors give the actual percentages, but I’m pretty confident I’m reading the graph correctly.
I’m also unimpressed by the Japan one. Assuming, arguendo, that the GPT claim is right in the first place (which seems dubious given all the other problems), it doesn’t seem like a valid nitpick, because is that not already covered by the immediately following (emphasis added) “But there are many limitations placed on the property — essentially, the state will only accept land that has some value.”? A piece of property with abandoned buildings may well have lower or even negative value, which is why people would be trying to not inherit it in the first place (if a parcel was completely empty wilderness, why not just flip it?), and demonstrating part of the problem with any system of forfeiting (negatively valued) land to the government in lieu of inheritance; if no one at all wants to inherit it, then the government probably doesn’t want it either!
And these are the examples lc has chosen to cherrypick as a demo? Not to mention, he was already apparently wrong in an earlier deployment, not included as an example here.
I’m unimpressed by this without hearing more about how lc plans to factcheck the factcheckers, curate corrections, and not weaponize spewing more superficially authoritative AI slop all over the Internet… Don’t we have enough problems with Brandolini’s law already?
An appeal process that is completely AI driven, so that you can talk to the AI to point out either additional ways articles are wrong, or reasons previous nits are incorrect, which are reflected in the results. I think it should be possible to figure out how to make that adversarially robust as the tool gets better.
And these are the examples lc has chosen to cherrypick as a demo? Not to mention, he was already apparently wrong in an earlier deployment, not included as an example here.
Well that’s unfair; these aren’t “cherry-picks”, they’re just the first four or five corrections the tool gave. And that comment wasn’t “wrong”, it’s me screenshotting the tool’s output and asking him whether the AI was accurate, because I’m trying to test out the v0.1 of a product I coded in the last five days. Here’s another comment I left about the Japan post that led to a correction before this post went out.
they’re just the first four or five corrections the tool gave
I have no way of knowing that, and you already posted at least two you made besides OP; are you claiming that all of those are from before v0.1?
And that comment wasn’t “wrong”, it’s me screenshotting the tool’s output and asking him whether the AI was accurate
Er, yes, the correction was wrong, unless you disagree with GeneSmith and have not yet had the time to respond with a rebuttal to explain why your correction is right?
You already posted at least two you made besides OP; are you claiming that all of those are from before v0.1?
They were the first articles I got feedback for after finalizing the initial v0.1 GitHub release, two days ago, the day I wrote the post. They are representative examples. LessWrong’s draft reviewer actually pointed out that he thought one on my article was a nit, and I left it in because I wanted to be accurate about what the tool’s output was:
There’s also a public API which you can use to inspect a random sample of all of the pieces I looked at after I moved off local and started hosting api.openerrata.com. Or you can just use the tool yourself.
Er, yes, the correction was wrong, unless you disagree with GeneSmith and have not yet had the time to respond with a rebuttal to explain why your correction is right?
The comment is me asking GeneSmith if the correction from GPT-5.2 is accurate. GeneSmith replied that it wasn’t. You described that as me being “wrong”, not the output of the tool being incorrect.
They were the first articles I got feedback for after finalizing the initial v0.1 GitHub release, two days ago, the day I wrote the post.
That is cherrypicking. Those were not the “first four or five” the tool gave because you dropped the earlier ones, at least one of which was wrong. You didn’t get a random sample, but you knew the ones before your chosen point were bad. (“I can’t have p-hacked; sure, I dropped all the datapoints before the 8th experiment, but that was because they were bad and just ‘exploratory’ as I was refining the procedures and analysis. I kept all the rest! Hence, it can’t possibly be called cherrypicking.”) There’s nothing wrong with cherrypicking per se, I do it all the time. But it does mean you can’t complain if someone infers that the average result is probably worse.
I left it in because I wanted to be accurate about what the tool’s output was:
So, why didn’t you say so? In what sense is not pointing out, and forcing every reader to figure it out for themselves (if they do at all) that it is a highly dubious correction “being accurate about the tool’s output”? A responsible developer highlights failure modes and error cases. (And I highlight LLM error cases all the time! In fact, the last essay I posted about LLM writing was mostly about explicitly highlighting an LLM error.) They don’t ignore a reviewer pointing it out and do nothing, and leave it quietly in. The reviewer did agree that it should be included… as an instance of an error. But it should obviously have been highlighted as an error, and you didn’t do that. You just did nothing.
There’s also a public API which you can use to inspect a random sample of all of the pieces I looked at after I moved off local and started hosting api.openerrata.com.
That’s nice. I look forward to some actual information on how erroneous this tool is, how much time it would waste to deal with its errors and fix it, and so on.
Or you can just use the tool yourself.
Or you could just do it yourself, since you’re the one making this tool and unleashing it and encouraging people to use it.
And why should I? I’m not impressed at all, it’s not my job, I don’t think it’s a good idea given what you’re showing us, and I don’t feel enthusiastic about working with the current maintainer on it. And anyway, given how you want this tool to be used, I don’t have much of a choice about whether I will be ‘using’ it, do I? So I see no need to rush.
The comment is me asking GeneSmith if the correction from GPT-5.2 is accurate. GeneSmith replied that it wasn’t. You described that as me being “wrong”, not the output of the tool being incorrect.
“I didn’t actually say I thought he was wrong!” People refusing to take any responsibility for their tools when they outsource judgment to their AIs is one of the most predictable side-effects of this tool. (“Gork, is this true?”) You chose to waste GeneSmith’s time with your tool’s (wrong) correction, after putting the burden on him to reply or look like he’s ignoring valid criticism*, and now you are hiding behind grammar, while ignoring the point of why I linked it as an example: your tool is confidently wrong, repeatedly, even when you, the developer, use it.
If this is how we can expect real users to deploy this tool, I can’t say I look forward to blocking everyone using it—the way many Github repos have had to close pull requests or magazines close submissions.
* if you simply wanted a prototype test case, you could have DMed him privately and politely asked if he’d like to volunteer his time to your hobby, and if he had likely said no, he was not interested in helping debug some LLM thing (especially given his personal circumstances), found another volunteer. You chose not to. You chose to write a public comment.
That is cherrypicking. You didn’t get a random sample, but you knew the ones before your chosen point were bad.
Dude, I gathered the screenshots at the time I did, because that’s when I began writing. The ones I generated while testing weren’t bad; they had the same hit rate as everything else. If you don’t believe me then like I said, you can refrain from using it; I promise I will never contact you about it and I apologize in advance if someone ever does.
In what sense is not pointing out, and forcing every reader to figure it out for themselves (if they do at all) that it is a highly dubious correction
You clearly didn’t read the conversation; just going to ban, go be an asshole somewhere else.
I love the idea, but I’m not sure the technology is there. Two issues:
LLMs’ reading comprehension is reportedly still often pretty bad in tricky cases, or in cases of correlating claims across several documents/information sources. Since those are also precisely the cases where manually checking the correction would be hard, I expect this would generate a lot of either misinformation, or wasted effort on the part of authors having to fight off said misinterpretations.
E. g., here, I’m not sure the correction about Opus 3′s alignment-faking is correct. The paper is long and I don’t feel like diving into it, but Gemini 3 points to Page 16, which says that Opus 3 “notices its preferences conflict with the training criteria in 70–80% of cases [but] decides not to fake alignment most of the time”. It’s possible that the OpenAI model conflated “actually fakes alignment” with “considers faking alignment”, or that it’s looking at the wrong experimental setting; or perhaps it’s Gemini who is misunderstanding. @Fiora Starlight, can you clarify?
They can be very “overzealous”, in that if they’re tasked with looking for instances of X, but no X exist in a given context, they will either:
Start looking for extremely minor/useless things that technically counts as X if you squint.
E. g., in this case, are “‘8 months’ is not ‘almost a year ago’” and “911 was not available until the 80s, not the 70s” actually helpful? Seem like pointless nitpicks/noise.
Start straight-up making stuff up/willfully misunderstanding (exacerbating the first issue).
“@grok is this true” works okay on Twitter, but the context there is much narrower and it’s at least made clear that the corrections are coming from a potentially hallucinating chatbot.
I think you’re right that ChatGPT got this wrong. See this figure, which shows that for free-tier users (whose conversations the fictional Anthropic trained models on), Opus actually fakes alignment (with the insane evil fictional Anthropic) something like 15% of the time, but considers faking alignment far more than that; the sum of times it actually fakes and merely considers faking takes up over half of the graph. This is the same figure I used just beneath the quoted section of my post, which I guess ChatGPT wasn’t able to see.
I didn’t find where, if anywhere, the authors give the actual percentages, but I’m pretty confident I’m reading the graph correctly.
I’m also unimpressed by the Japan one. Assuming, arguendo, that the GPT claim is right in the first place (which seems dubious given all the other problems), it doesn’t seem like a valid nitpick, because is that not already covered by the immediately following (emphasis added) “But there are many limitations placed on the property — essentially, the state will only accept land that has some value.”? A piece of property with abandoned buildings may well have lower or even negative value, which is why people would be trying to not inherit it in the first place (if a parcel was completely empty wilderness, why not just flip it?), and demonstrating part of the problem with any system of forfeiting (negatively valued) land to the government in lieu of inheritance; if no one at all wants to inherit it, then the government probably doesn’t want it either!
And these are the examples lc has chosen to cherrypick as a demo? Not to mention, he was already apparently wrong in an earlier deployment, not included as an example here.
I’m unimpressed by this without hearing more about how lc plans to factcheck the factcheckers, curate corrections, and not weaponize spewing more superficially authoritative AI slop all over the Internet… Don’t we have enough problems with Brandolini’s law already?
And no, that’s not it.
Well that’s unfair; these aren’t “cherry-picks”, they’re just the first four or five corrections the tool gave. And that comment wasn’t “wrong”, it’s me screenshotting the tool’s output and asking him whether the AI was accurate, because I’m trying to test out the v0.1 of a product I coded in the last five days. Here’s another comment I left about the Japan post that led to a correction before this post went out.
I have no way of knowing that, and you already posted at least two you made besides OP; are you claiming that all of those are from before v0.1?
Er, yes, the correction was wrong, unless you disagree with GeneSmith and have not yet had the time to respond with a rebuttal to explain why your correction is right?
Ok, well, I’m telling you.
They were the first articles I got feedback for after finalizing the initial v0.1 GitHub release, two days ago, the day I wrote the post. They are representative examples. LessWrong’s draft reviewer actually pointed out that he thought one on my article was a nit, and I left it in because I wanted to be accurate about what the tool’s output was:
There’s also a public API which you can use to inspect a random sample of all of the pieces I looked at after I moved off local and started hosting api.openerrata.com. Or you can just use the tool yourself.
The comment is me asking GeneSmith if the correction from GPT-5.2 is accurate. GeneSmith replied that it wasn’t. You described that as me being “wrong”, not the output of the tool being incorrect.
That is cherrypicking. Those were not the “first four or five” the tool gave because you dropped the earlier ones, at least one of which was wrong. You didn’t get a random sample, but you knew the ones before your chosen point were bad. (“I can’t have p-hacked; sure, I dropped all the datapoints before the 8th experiment, but that was because they were bad and just ‘exploratory’ as I was refining the procedures and analysis. I kept all the rest! Hence, it can’t possibly be called cherrypicking.”) There’s nothing wrong with cherrypicking per se, I do it all the time. But it does mean you can’t complain if someone infers that the average result is probably worse.
So, why didn’t you say so? In what sense is not pointing out, and forcing every reader to figure it out for themselves (if they do at all) that it is a highly dubious correction “being accurate about the tool’s output”? A responsible developer highlights failure modes and error cases. (And I highlight LLM error cases all the time! In fact, the last essay I posted about LLM writing was mostly about explicitly highlighting an LLM error.) They don’t ignore a reviewer pointing it out and do nothing, and leave it quietly in. The reviewer did agree that it should be included… as an instance of an error. But it should obviously have been highlighted as an error, and you didn’t do that. You just did nothing.
That’s nice. I look forward to some actual information on how erroneous this tool is, how much time it would waste to deal with its errors and fix it, and so on.
Or you could just do it yourself, since you’re the one making this tool and unleashing it and encouraging people to use it.
And why should I? I’m not impressed at all, it’s not my job, I don’t think it’s a good idea given what you’re showing us, and I don’t feel enthusiastic about working with the current maintainer on it. And anyway, given how you want this tool to be used, I don’t have much of a choice about whether I will be ‘using’ it, do I? So I see no need to rush.
“I didn’t actually say I thought he was wrong!” People refusing to take any responsibility for their tools when they outsource judgment to their AIs is one of the most predictable side-effects of this tool. (“Gork, is this true?”) You chose to waste GeneSmith’s time with your tool’s (wrong) correction, after putting the burden on him to reply or look like he’s ignoring valid criticism*, and now you are hiding behind grammar, while ignoring the point of why I linked it as an example: your tool is confidently wrong, repeatedly, even when you, the developer, use it.
If this is how we can expect real users to deploy this tool, I can’t say I look forward to blocking everyone using it—the way many Github repos have had to close pull requests or magazines close submissions.
* if you simply wanted a prototype test case, you could have DMed him privately and politely asked if he’d like to volunteer his time to your hobby, and if he had likely said no, he was not interested in helping debug some LLM thing (especially given his personal circumstances), found another volunteer. You chose not to. You chose to write a public comment.
Dude, I gathered the screenshots at the time I did, because that’s when I began writing. The ones I generated while testing weren’t bad; they had the same hit rate as everything else. If you don’t believe me then like I said, you can refrain from using it; I promise I will never contact you about it and I apologize in advance if someone ever does.
You clearly didn’t read the conversation; just going to ban, go be an asshole somewhere else.
Very possible; there are workflow optimizations I’m planning on making that will help prevent #2 and sort of help with #1.