There might be a canary for chain of thought illegibility
Namingly, we would expect both inferencing the CoT of any CoT model with any other CoT model should get better results than running straight with the weaker model.
We would therefore expect a linear hierarchy of models based on this property
If there is hidden information that the other model did not know to look for/ (not from the hidden structure of public language) we would expect cyclicity, where
A cot/ A public answer and B cot/ B public answer both beating A cot/ B public answer and B/cot A/public answer
This should also be true for sections of CoT
We could therefore limit the ability to hide stuff by only taking gradients on which CoT works (limiting encoding to anything the model would figure out in the public answer in low tokens) (and managing co-selection effects), or by only taking gradients on the head(with a fixed CoT model), and in the second case mixing CoT sources(so we get a general, finalize this CoT model)
In CoT, the thoughts and the answer can come from different fine-tuned models, and their co-evolution can be managed.
This might be able to prohibit fully illegible chains of thought, by adding the constraint that unfinetuned and old checkpoint answer and differently trained models can all work off the CoT.
Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand
Is something testable.
There might be empirical laws there
If we can check how much of the performance gains come from improved reading of CoT or from improved writing of CoT
In CoT, you give a prompt, then you have your chain of thought model (model 1) generate a bunch of tokens working on the problem. When some condition occurs (Either the model claiming it is done or you run out of thinking tokens (or some learned balance of the two), you stop running the Chain of Thought model. You then have another model, (model 2) take that chain of thought and present the answer/conclusion to the user. You can mix and match models between these two steps.
In their delivery as packaged products, most providers use the same model for model 1 and model 2, (as far as is public) or variants that have been fine-tuned together in such a way.
For models with public chain of thought, you can test this mixing and matching.
You can even staple chains of thoughts from multiple different generators together.
Moving fully to neuralese or a specialized encoding for mainline (not just secret), reasoning would mean that mixing and matching models would likely not work anymore, while currently it does.
There might be a canary for chain of thought illegibility
Namingly, we would expect both inferencing the CoT of any CoT model with any other CoT model should get better results than running straight with the weaker model.
We would therefore expect a linear hierarchy of models based on this property
If there is hidden information that the other model did not know to look for/ (not from the hidden structure of public language) we would expect cyclicity, where
A cot/ A public answer and B cot/ B public answer both beating A cot/ B public answer and B/cot A/public answer
This should also be true for sections of CoT
We could therefore limit the ability to hide stuff by only taking gradients on which CoT works (limiting encoding to anything the model would figure out in the public answer in low tokens) (and managing co-selection effects), or by only taking gradients on the head(with a fixed CoT model), and in the second case mixing CoT sources(so we get a general, finalize this CoT model)
In CoT, the thoughts and the answer can come from different fine-tuned models, and their co-evolution can be managed.
This might be able to prohibit fully illegible chains of thought, by adding the constraint that unfinetuned and old checkpoint answer and differently trained models can all work off the CoT.
Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand
Is something testable.
There might be empirical laws there
If we can check how much of the performance gains come from improved reading of CoT or from improved writing of CoT
I’m having trouble understanding your suggestion, especially the second paragraph. Could you spell it out a bit more?
In CoT, you give a prompt, then you have your chain of thought model (model 1) generate a bunch of tokens working on the problem. When some condition occurs (Either the model claiming it is done or you run out of thinking tokens (or some learned balance of the two), you stop running the Chain of Thought model. You then have another model, (model 2) take that chain of thought and present the answer/conclusion to the user. You can mix and match models between these two steps.
In their delivery as packaged products, most providers use the same model for model 1 and model 2, (as far as is public) or variants that have been fine-tuned together in such a way.
For models with public chain of thought, you can test this mixing and matching.
You can even staple chains of thoughts from multiple different generators together.
Moving fully to neuralese or a specialized encoding for mainline (not just secret), reasoning would mean that mixing and matching models would likely not work anymore, while currently it does.