Thank you for sharing this work! I was interested in extending your work using prefill to assess whether illegible reasoning is load bearing. However, I had some difficulty reproducing the illegibility results for R1. I also noticed that illegibility scores don’t appear very consistent between grader models, and might be picking up some other behaviors like Chinese tokens (even if coherent in context). I think the LLM-as-judge scores are useful as a first pass filter, but I’m concerned they might not be precise enough to draw conclusions about whether illegibility is load bearing. I wrote a bit more about what I tried here (https://www.lesswrong.com/posts/WbP39ncim9hBsYn5t/what-counts-as-illegible-reasoning) and I would welcome more discussion!
I also want to share a bit about what I tried regarding prefill strategies. (I didn’t write this up because of the concerns I mentioned about LLM scores as an illegibility metric). My plan was to use something like the thought anchors methodology (https://www.thought-anchors.com/), which chunks CoT into sentences and assesses how load-bearing each sentence was. I ended up trying a different strategy: prefilling an entire CoT, generating multiple responses with the full CoT prefilled, and seeing whether the model had a higher accuracy on the question with that CoT compared to its average accuracy on the question with no prefill. I found that for QwQ, more illegible prefills (as graded by GPT-4o with your prompt) led to a higher correct %, but the effect went away when controlling for reasoning length. I am still very curious as to why this behavior occurs and would be happy to discuss more if you are interested!
Thank you for sharing this work! I was interested in extending your work using prefill to assess whether illegible reasoning is load bearing. However, I had some difficulty reproducing the illegibility results for R1. I also noticed that illegibility scores don’t appear very consistent between grader models, and might be picking up some other behaviors like Chinese tokens (even if coherent in context). I think the LLM-as-judge scores are useful as a first pass filter, but I’m concerned they might not be precise enough to draw conclusions about whether illegibility is load bearing. I wrote a bit more about what I tried here (https://www.lesswrong.com/posts/WbP39ncim9hBsYn5t/what-counts-as-illegible-reasoning) and I would welcome more discussion!
I also want to share a bit about what I tried regarding prefill strategies. (I didn’t write this up because of the concerns I mentioned about LLM scores as an illegibility metric). My plan was to use something like the thought anchors methodology (https://www.thought-anchors.com/), which chunks CoT into sentences and assesses how load-bearing each sentence was. I ended up trying a different strategy: prefilling an entire CoT, generating multiple responses with the full CoT prefilled, and seeing whether the model had a higher accuracy on the question with that CoT compared to its average accuracy on the question with no prefill. I found that for QwQ, more illegible prefills (as graded by GPT-4o with your prompt) led to a higher correct %, but the effect went away when controlling for reasoning length. I am still very curious as to why this behavior occurs and would be happy to discuss more if you are interested!