I agree this is evidence, but I’m not sure it’s strong evidence. For example, illegible reasoning may help the model reach the right answer, while being worse than other, more legible, reasoning the model could use. If a model could make better use of some text for reasoning than we have capacity to monitor it, that’s still pretty concerning.
Ah, I agree with this and it is what I was trying to say in the (ambiguously phrased) line you quoted from the post.
By “not helping the model reach the right answer,” I meant “not helping relative to what it could achieve if constrained to be legible (say, by rejection sampling using your judge),” rather than “not helping relative to not performing that portion of the reasoning at all (say, by early-exiting the CoT when it starts to become illegible).” While I do suspect that the latter claim is also approximately true for the Targon gibberish, that’s just a hunch and I did not attempt to test that hypothesis in this post.
More legible reasoning traces might be more monitorable, but almost all of the Astra Fellowship empirical mentors seem to be doing CoT monitorability research now (this is based on a quick skim of their profiles), and I don’t currently believe that this broad cluster of “CoT stuff” is all that important for x-risk, I personally think that it’s actively harmful (ref. Don’t Align Agents to Evaluations of Plans) and even if it wasn’t the field would still be grossly over-investing in CoT controllability/monitorability/legibility/etc for bad groupthink/hype/information-cascade-y reasons that mainly tracks Coefficient Giving grantmakers collectively having an inertia delayed reaction to “let’s think step by step” improving benchmark performance on math problems and thus leading to the unchecked proliferation of “effort level/scratchpad length” configurations into the roadmap of every lab.
And what’s the point? Even findings that lead to ~unanimous consensus amongst this crowd saying “never train on CoT” are basically ignored with CoT flagrantly leaking into RL training corpuses at least 8% of the time.
I’m not concerned at all by a model being “able to make better use of some text for reasoning than we have capacity to monitor it”, I’m utterly confused that this is even a concern and find it bizarre that this appears to be the majority view. I could not pass this viewpoint‘s ideological Turing test. What universe are we in? Who is writing these papers? Does the emperor have any clothes? Modern CoT monitoring is the equivalent of going through your teenage niece’s smartphone looking for suspicious messages and assigning detention after finding “after finals lets sue the school for all the 9 pm classes hehe” in the transcripts. Maybe it makes more sense if you are the HR department of a high moral maze institution doing “employee alignment” on AIs? Maybe they expect to live in the generous movie world where every would-be bad guy dramatically announces to the audience “Watch out, here I come!”, screaming the name of their special comic book villain move?
Token soup seems totally benign, on the level of linguistic drift, it’s not devoid of information it just has its own structure that you can learn if you tried to
Ah, I agree with this and it is what I was trying to say in the (ambiguously phrased) line you quoted from the post.
By “not helping the model reach the right answer,” I meant “not helping relative to what it could achieve if constrained to be legible (say, by rejection sampling using your judge),” rather than “not helping relative to not performing that portion of the reasoning at all (say, by early-exiting the CoT when it starts to become illegible).” While I do suspect that the latter claim is also approximately true for the Targon gibberish, that’s just a hunch and I did not attempt to test that hypothesis in this post.
More legible reasoning traces might be more monitorable, but almost all of the Astra Fellowship empirical mentors seem to be doing CoT monitorability research now (this is based on a quick skim of their profiles), and I don’t currently believe that this broad cluster of “CoT stuff” is all that important for x-risk, I personally think that it’s actively harmful (ref. Don’t Align Agents to Evaluations of Plans) and even if it wasn’t the field would still be grossly over-investing in CoT controllability/monitorability/legibility/etc for bad groupthink/hype/information-cascade-y reasons that mainly tracks Coefficient Giving grantmakers collectively having an inertia delayed reaction to “let’s think step by step” improving benchmark performance on math problems and thus leading to the unchecked proliferation of “effort level/scratchpad length” configurations into the roadmap of every lab.
And what’s the point? Even findings that lead to ~unanimous consensus amongst this crowd saying “never train on CoT” are basically ignored with CoT flagrantly leaking into RL training corpuses at least 8% of the time.
I’m not concerned at all by a model being “able to make better use of some text for reasoning than we have capacity to monitor it”, I’m utterly confused that this is even a concern and find it bizarre that this appears to be the majority view. I could not pass this viewpoint‘s ideological Turing test. What universe are we in? Who is writing these papers? Does the emperor have any clothes? Modern CoT monitoring is the equivalent of going through your teenage niece’s smartphone looking for suspicious messages and assigning detention after finding “after finals lets sue the school for all the 9 pm classes hehe” in the transcripts. Maybe it makes more sense if you are the HR department of a high moral maze institution doing “employee alignment” on AIs? Maybe they expect to live in the generous movie world where every would-be bad guy dramatically announces to the audience “Watch out, here I come!”, screaming the name of their special comic book villain move?
Token soup seems totally benign, on the level of linguistic drift, it’s not devoid of information it just has its own structure that you can learn if you tried to