However… it is apparently very easy to set up an inference server for R1 incorrectly, and if you aren’t carefully discriminating about which OpenRouter providers you accept[2], you will likely get one of the “bad” ones at least some of the time.
From what I remember, I did see that some providers for R1 didn’t return illegible CoTs, but that those were also the providers marked as serving a quantized R1. When I filtered for the providers that weren’t marked as such I think I pretty consistently found illegible CoTs on the questions I was testing? Though there’s also some variance in other serving params—a low temperature also reduces illegible CoTs.
Thanks for running this! I think this is pretty convincing re your original argument that my results were contaminated by bad inference setups, and that correspondingly my paper doesn’t substantiate some of its claims and conclusions with good evidence. In the past couple months I’ve gotten some other feedback about the paper as well, about the grader inflating some illegibility scores for reasoning that’s legible but hard to follow (as far as I could tell this isn’t as big a problem as the one you point out here however).
In retrospect I think I was being much shoddier than I’d like while writing it, mostly because it was my first time writing results as a paper (as opposed to a blog post), and subconsciously felt all the claims should feel pretty substantial and clean. There was enough uncertainty around whether or not which providers had the right setups to make claims that felt defensible to me, when obviously I should have been trying much harder than “defensible to reviewers” for writing up results. Rather than just using the Openrouter labels, I should have been running capability evaluations with different providers to see which one was serving it properly. I’m not entirely sure what to do with the paper.
I think some other results in the paper are interesting even with the providers problem, such as Figures 3 and 5 arguing that the models still find some use out of the illegible reasoning (and I think the Qwen results aren’t as provider-sensitive, but I’m not confident here), but in a world where I was reporting results in a more calibrated way at the time, I think it should have been a mildly interesting blogpost about a subset of the results. I apologize for worsening the epistemic environment by not making that choice.
Some more specific thoughts:
I agree this is evidence, but I’m not sure it’s strong evidence. For example, illegible reasoning may help the model reach the right answer, while being worse than other, more legible, reasoning the model could use. If a model could make better use of some text for reasoning than we have capacity to monitor it, that’s still pretty concerning.
Figure 5 of the paper was an attempt to look at this, by generating 100 samples each for 100 questions, and seeing if there were overall patterns in correctness going up for more legible CoTs. There was very little overall correlation, though there were lots of questions which individually saw performance up or down with more legible CoTs.
I do think we have much better evidence now to go off of, however. Removing the arguments related to my paper, I think Bronson’s claims still hold up pretty well.