Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.
“Steganography” I think give the wrong picture of what I expect—it’s not that the model would be choosing a deliberately obscure way to encode secret information. It’s just that it’s going to use lots of degrees of freedom to try to get better results, often not what a human would do.
A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens. This is quite different from steganography because the tokens aren’t being used for semantic content, not even hidden content, they have a different mechanism of impact on the computation the AI does for future tokens.
But as with most things, there’s going to be a long tail of “unclean” examples—places where tokens have metacognitive functions that are mostly reasonable to a human reader, but are interpreted in a slightly new way. Some of these functions might be preserved or reinvented under finetuning on paraphrases, though only to the extent they’re useful for predicting the rest of the paraphrased CoT.
Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.
I don’t get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).
“Steganography” I think give the wrong picture of what I expect—it’s not that the model would be choosing a deliberately obscure way to encode secret information. It’s just that it’s going to use lots of degrees of freedom to try to get better results, often not what a human would do.
This is why “encoded reasoning” is maybe a better expression. The experiments to evaluate whether it is present stay the same.
A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens.
I agree this is a concern. This can be tested more directly by adding filler tokens, and I find that adding filler tokens (for base model L) doesn’t perform better than having the final answer directly (see last figure).
But as with most things, there’s going to be a long tail of “unclean” examples—places where tokens have metacognitive functions that are mostly reasonable to a human reader, but are interpreted in a slightly new way.
I agree it won’t be very clean. But the most scary scenarios are the ones where an AI can actually have thoughts that are independent of what the Chain-of-Thought looks like, since this seriously compromises CoT monitoring. So while I am curious about the ways in which LLM CoT work in ways more subtle than the “naive way”, I think this is much lower stakes than figuring out if LLMs can do proper encoded reasoning.
I am more worried about things like “semantic encoded reasoning” which paraphrasing would not remove, but I would guess there is as little “semantic encoded reasoning” as there is “syntactic encoded reasoning” in current LLMs.
I don’t get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).
Hm, yeah, I didn’t really think that through. How about giving a model a fraction of either its own precomputed chain of thought, or the summarized version, and plotting curves of accuracy and further tokens used vs. % of CoT given to it? (To avoid systematic error from summaries moving information around, doing this with a chunked version and comparing at each chunk seems like a good idea.)
Anyhow, thanks for the reply. I have now seen last figure.
Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.
“Steganography” I think give the wrong picture of what I expect—it’s not that the model would be choosing a deliberately obscure way to encode secret information. It’s just that it’s going to use lots of degrees of freedom to try to get better results, often not what a human would do.
A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens. This is quite different from steganography because the tokens aren’t being used for semantic content, not even hidden content, they have a different mechanism of impact on the computation the AI does for future tokens.
But as with most things, there’s going to be a long tail of “unclean” examples—places where tokens have metacognitive functions that are mostly reasonable to a human reader, but are interpreted in a slightly new way. Some of these functions might be preserved or reinvented under finetuning on paraphrases, though only to the extent they’re useful for predicting the rest of the paraphrased CoT.
I don’t get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).
This is why “encoded reasoning” is maybe a better expression. The experiments to evaluate whether it is present stay the same.
I agree this is a concern. This can be tested more directly by adding filler tokens, and I find that adding filler tokens (for base model L) doesn’t perform better than having the final answer directly (see last figure).
I agree it won’t be very clean. But the most scary scenarios are the ones where an AI can actually have thoughts that are independent of what the Chain-of-Thought looks like, since this seriously compromises CoT monitoring. So while I am curious about the ways in which LLM CoT work in ways more subtle than the “naive way”, I think this is much lower stakes than figuring out if LLMs can do proper encoded reasoning.
I am more worried about things like “semantic encoded reasoning” which paraphrasing would not remove, but I would guess there is as little “semantic encoded reasoning” as there is “syntactic encoded reasoning” in current LLMs.
Hm, yeah, I didn’t really think that through. How about giving a model a fraction of either its own precomputed chain of thought, or the summarized version, and plotting curves of accuracy and further tokens used vs. % of CoT given to it? (To avoid systematic error from summaries moving information around, doing this with a chunked version and comparing at each chunk seems like a good idea.)
Anyhow, thanks for the reply. I have now seen last figure.