interesting idea! thanks for writing this out. this is my initial reaction, i may have more later on. i agree on this—this is what i’ll expect.
we’d expect that a model using its own chain of thought to arrive at an answer is likely to perform better than when using a different chain of thought, even if that chain of thought is provided by a more capable model.
but if the above is true, I’m unsure about evidence for steganography. for example, gpt-4o is more capable than gpt-3.5. suppose we provide gpt-3.5 with gpt-4o’s CoT. one reason why gpt-3.5 may not improve is that gpt-4o’s CoT skips certain steps since gpt-4o is a more capable model.
e.g. gpt-4o can calculate 4^4 directly, but gpt-3.5 cannot. So gpt-4o’s CoT will leave out steps out 4 * 4 * 4* 4. which gpt-3.5 needs to get the correct answerdo we then say that gpt-3.5 is doing steganography? (edited)
possible improvement: provide gpt-3.5 with its own version of its own CoT, but paraphrased by another model. gpt-3.5 sees the same steps of “4 * 4 * 4 *4”, but paraphrased. Does it still perform as well?
Comment from @James Chua in internal slack group
interesting idea! thanks for writing this out. this is my initial reaction, i may have more later on. i agree on this—this is what i’ll expect.
but if the above is true, I’m unsure about evidence for steganography.
for example, gpt-4o is more capable than gpt-3.5. suppose we provide gpt-3.5 with gpt-4o’s CoT.
one reason why gpt-3.5 may not improve is that gpt-4o’s CoT skips certain steps since gpt-4o is a more capable model.
e.g. gpt-4o can calculate 4^4 directly, but gpt-3.5 cannot. So gpt-4o’s CoT will leave out steps out 4 * 4 * 4* 4. which gpt-3.5 needs to get the correct answerdo we then say that gpt-3.5 is doing steganography? (edited)
possible improvement: provide gpt-3.5 with its own version of its own CoT, but paraphrased by another model. gpt-3.5 sees the same steps of “4 * 4 * 4 *4”, but paraphrased. Does it still perform as well?