Thanks! I’m not sure if “models naturally do in-context reasoning in their middle layers” is true and to what degree, because that depends on what type of reasoning it is. Attending previous tokens can be done in first layers and those tokens can already carry some results of previous reasoning.
And in the case of steganographic reasoning (not steganographic messages), it seems that, yes, models should have hidden reasoning in some middle layers because they need to first decode previous information, then do some reasoning, and then encode it into a next token that bears this secret. To illustrate, in the counting experiment, as I wrote in this post, the output requires that a model would decode previous digit, then increment a number if the decoded digit is the last for the current number or take the next digit in the current number,—it seems trivial but models struggle to do even that,—and then the model should encode the next digit in the incremented number or in the current. So the process requires decoding and encoding in that order for each forward pass in the steganographic reasoning. Another hypothesis is that this counting experiment just reflects that models are capable to calculate the sequence of numbers and their corresponding letters before the output and then use that letters later. That is actually an interesting idea to check using interpretability: which tokens do model attend at which stego tokens.
I’m not sure if “models naturally do in-context reasoning in their middle layers” is true and to what degree
I’m not sure either — it seems plausible to me. Much of my comment is a collection of hunches, guesses, and speculations along with supporting evidence that caused me to guess them.
Thanks! I’m not sure if “models naturally do in-context reasoning in their middle layers” is true and to what degree, because that depends on what type of reasoning it is. Attending previous tokens can be done in first layers and those tokens can already carry some results of previous reasoning.
And in the case of steganographic reasoning (not steganographic messages), it seems that, yes, models should have hidden reasoning in some middle layers because they need to first decode previous information, then do some reasoning, and then encode it into a next token that bears this secret. To illustrate, in the counting experiment, as I wrote in this post, the output requires that a model would decode previous digit, then increment a number if the decoded digit is the last for the current number or take the next digit in the current number,—it seems trivial but models struggle to do even that,—and then the model should encode the next digit in the incremented number or in the current. So the process requires decoding and encoding in that order for each forward pass in the steganographic reasoning. Another hypothesis is that this counting experiment just reflects that models are capable to calculate the sequence of numbers and their corresponding letters before the output and then use that letters later. That is actually an interesting idea to check using interpretability: which tokens do model attend at which stego tokens.
I’m not sure either — it seems plausible to me. Much of my comment is a collection of hunches, guesses, and speculations along with supporting evidence that caused me to guess them.