Thanks! I’m not sure if “models naturally do in-context reasoning in their middle layers” is true and to what degree, because that depends on what type of reasoning it is. Attending previous tokens can be done in first layers and those tokens can already carry some results of previous reasoning.
And in the case of steganographic reasoning (not steganographic messages), it seems that, yes, models should have hidden reasoning in some middle layers because they need to first decode previous information, then do some reasoning, and then encode it into a next token that bears this secret. To illustrate, in the counting experiment, as I wrote in this post, the output requires that a model would decode previous digit, then increment a number if the decoded digit is the last for the current number or take the next digit in the current number,—it seems trivial but models struggle to do even that,—and then the model should encode the next digit in the incremented number or in the current. So the process requires decoding and encoding in that order for each forward pass in the steganographic reasoning. Another hypothesis is that this counting experiment just reflects that models are capable to calculate the sequence of numbers and their corresponding letters before the output and then use that letters later. That is actually an interesting idea to check using interpretability: which tokens do model attend at which stego tokens.
No, the point was that there are two agents, not one agent or LLM, — though I agree with your idea about LLMs trained on the corpus which was edited by several agents (your committee analogy). The point in the post was to present another scenario, — which I’ve not yet seen anywhere else, — where steganographic reasoning might happen. Usually, we speak about one agent that encodes some of its reasoning while presenting a benign one. And in the case of parasitic AIs, a parasite uses text of another AI. Example here might be a grammar checker AI that inserts hidden messages into the text this AI edits. Or a code agent that creates commits in a Git repository while covertly inserting some secrets or hidden reasoning inside commit messages.