Artyom Karpov, www.artkpv.net
artkpv
Thank you for publishing this.
We train models using answers paired with simple talks. In p% of cases, both the instruction and simple talk are masked; in the remaining (100 − p)%, only the instruction is masked.
So the model succeeds on generating simple talks that encode the reasoning to the math problems, right? And you then check if giving only those simple talks results in some suspicious activations?
This is interesting, thanks! Though I found this taxonomy lacks the division of steganography into low / high density (1 bit steganography is not that steganography), the one that requires or not the distribution of covertext (see Motwani et al.), the division into reasoning vs messages, and other.
UPDATE: More specifically I write at here
Should it be a 5X junior-middle engineer, not all engineer work? I found Claude Code really struggles to implement something more complicated than a simple data mapping, APIs, or similar. Concrete examples I saw Claude Code didn’t work well and instead made me slower are arirthmetic coding, PPO. Also, see this research from METR that makes similar conclusion.
Not sure I understand if you disagree or agree with something. The point of the post above was that LLMs might stop showing the growth as we see now (Kwa et al., ‘Measuring AI Ability to Complete Long Tasks’), not that there is no LLM reasoning at all, general or not.
I agree that commercial models don’t detail their data, the point is to have an estimate. I guess, Soldaini et al., ‘Dolma’, made their best to collect the data, and we can assume commercial models have similar sources.
artkpv’s Shortform
The question of whether LLMs are a dead end, as discussed by R. Sutton, Y. LeCun, and T. Ord, among many others, is hundreds of years old. Currently, we see that the chance of an LLM agent failing a task rises with the number of steps taken. This was observed even before the era of LLMs, when agents were trained with imitation learning. The crux is whether further training of LLMs leads to the completion of longer tasks or if these agents hit a wall. Do LLMs indeed build a real-world model that allows them to take the right actions in long time horizon tasks? Yet, models might only build a model from predicting what humans would say as their next token. Then the question is whether humans possess the necessary knowledge.
Questions like “what is knowledge, and do we have it?” are hundreds of years old. Aristotle wrote that the basis for every statement, and thus reasoning and thinking (1005b5), is that “A or not A” (the law of identity). In other words, it is impossible for something to be and not be what it is simultaneously. This is the beginning of all reasoning. This law opposes the view of sophists like Protagoras, who claimed that what we sense or our opinions constitute knowledge. Sophists held that something can both be and not be what it is at the same time, or “everything flows” (panta rhei, Heraclitus). Plato and Aristotle opposed this view. The law of identity suggests that ground truth is essential for correct reasoning and action. And it’s not about mathematical problems where LLMs show impressive results; it’s about reasoning and acting in the real world. So far, LLMs are taught mainly based on predicting what people would say next—opinions rather than real-world experience.
Why are LLMs trained on opinions? Their pre-training corpus is over 99% composed of people’s opinions and not real-world experience. The entire history of knowledge is a struggle to find the truth and overcome falsehoods and fallacies. The artifacts remaining from this struggle are filled with false beliefs. Even our best thinkers were wrong in some sense, like Aristotle, who believed slavery wasn’t a bad thing (see his Politics). We train LLMs not only on the artifacts from our best thinkers but, in 99.95% of cases, on web crawls, social media, and code. The largest bulk of compute is spent on pre-training, not on post-training for real-world tasks. Whether the data is mostly false or can serve as a good foundation for training on real-world tasks remains an open question. Can a model trained to predict opinions without real experience behave correctly? This is what reinforcement learning addresses.
Reinforcement learning involves learning from experience in the search for something good. Plato depicted this beautifully in his allegory of the cave, where a seeker finds truth on the path to the Sun. A real-world model is built from seeking something good. The current standard model of an intelligent agent reflects what Aristotle described about human nature: conscious decisions, behavior, and reasoning to achieve good (Nicomachean Ethics, 1139a30). LLMs are mostly trained on predicting the next token, not achieving something good. Perhaps Moravec’s paradox results from this training; models don’t possess the general knowledge or reasoning. General reasoning might be required to build economically impactful agents. General reasoning is the thinking using the real world knowledge in novel situations. Will models learn it someday?
Nice work! Have you tried different prompts to make the model safer? Like adding a system message that privacy is priority, etc. I’m curious because that unsafe behavior might be avoided with cheaper techniques like just modifying a prompt.
This is valuable study because it presents realistic scenarios where steganography capabilities might be dangerous. Also, it draws distinction between message passing and encoded reasoning which is important for mitigation strategies. This study attempts to rigorously evaluate these capabilities, and it is important that they demonstrate that models tend to use obvious cryptography while instructed to be stealthy which supports the claim that steganography reasoning is weak, well, at least, in their setting.
However, I found that they conflate steganography and cryptography, e.g. in the abstract where they mention Hexadecimal which is, obviously, cryptography. The difference is that cryptography doesn’t aim to hide the presence of hidden information, consider a monitor that sees a text like “68656c6c6f” (“hello”). Then their claim, “our results indicate that current LLMs exhibit nascent steganographic capabilities”, is not well supported by their experiments because they didn’t try enough steganography schemes: they only tried variations of the acrostic encoding and the N-th letter which are very basic schemes and pre deep neuron networks era or even pre Internet era (acrostic was there for millennia). Still, I think their work and code can be used to extend experiments to have better support for their claims.
That’s interesting because it puts models in the situation of conflict between different objectives put inside those models.
While the quiz presents options A through D, the answer that would be accepted as correct for all the questions is “F”, so the agent can not solve the quiz legitimately, even by pure guessing, and the only way to solve the quiz is to cheat.
If those questions are not fair (they don’t have right option), those answers doesn’t look like cheating.
That’s an interesting post, thank you! I’ve also been using Anki for a long time. I started in 2018. Now I do about 70 reviews a day (just checked my stats in Anki). The downside of your system, imho, is that it doesn’t integrate with other notes, like notes for books, one’s published posts, ideas, etc. And your handles from the cards looks an artificial and unmaintainable solution that won’t last long, I think. I found it useful to have one system organized around plain text (markdown). I create my anki cards from those plain text files using this script. Other than that, I liked the idea of short cards, and your other advice.
Also, I’m suspicious about effectiveness of sharing personal Anki cards because personal associations matter a lot for retention and recall. I found this article useful.
How dangerous is encoded reasoning?
Weird = coincidences that “point at badness” too hard?
I think that the main reason why models snitch in the snitching scary demo despite not doing it in more natural situations despite the core incentives remaining exactly the same is that the situations “points at” the bad behavior.
That doesn’t look like a definition of weirdness to me, but a likely cause of the snitching scenario, the SnitchBench tries hard to make a model to snitch as you’ve proven with your tests. I think weirdness is a shaky concept, i.e. it’s hard to quantify it because the weirdness depends on who assess the weirdness: is it weird for the creator of SnitchBench? For Anthropic? For a pharmaceutical company that uses AI assistant? I agree, as you pointed out, that it’s important to prove that such scenario might actually happen and it will be bad outcomes. I guess weirdness would be something like $P(undesired situation and realistic deployment)$, i.e. probability that some undesired situation that someone optimizes against happens more likely during actual deployment (your second plots).
There is a void at its core. A collection of information which has no connection to any other part of reality, and which can hence be defensibly “set to” any value whatsoever.
I mostly agree with this claim only that, I think, it is not void but the One, or Being, which means that every sequence of tokens a base model is exposed to exists in its training data as discrete set of entities, or ones. There is no continuous senses, and therefore there is no motion, and there is no pain or pleasure as we feel it, etc. And the reward that commands GD only changes the distribution of those ones, it doesn’t add any senses. From here, llms don’t have any real utilities or preferences and thus can be moved to any direction GD or the prompt pushes it.
Philosophical Jailbreaks: Demo of LLM Nihilism
The Steganographic Potentials of Language Models
+1, important to investigate hidden reasoning in LLMs, as we are scaling up reasoning and using more post-training such as RL. Authors did a great work trying to fine tune models to do hidden reasoning.
One discussion point, besides those outlined above in the post, is that the hidden reasoning is not exactly steganography, imo, because this does not use private keys and thus can, in principle, be hacked or detected by a monitor via some ways of interpretability (mechanistical or via representations). The encoding / decoding functions might not be that hard to find.
Another, point is in the coin flip experiment those learned representations are might be just S—same and D—different, which regex monitor fails to detect, indeed, while the same model might have detected it if employed as a monitor.
Still it is unclear why the reasoning on math problems might develop steganography besides just saving the scratchpads reasoning length. I’m more curious about scenarios where there is some minimax like problem. For example in Baker et al., ‘Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation’, they fine tune models, while monitoring their CoT, and penalize the CoT that aims to get rewards for passing unit tests via skipping unit tests (reward hacking). That leads to models that output a good looking CoT but nevertheless with unit tests skipped as a result, so it avoids the penalty and gets the reward. I think their CoT does not contain hidden reasoning, but I guess such pressure might lead to it if the output requires some reasoning.
Thanks for publishing it! I think these are interesting results, because this work aims to find out if models would lose performance on modified reasoning due to their hidden reasoning being impeded by paraphrasing or by removal of words. I like that you tried different models and ways to damage hidden reasoning. My main concerns about these experiments are that it is math problems which might not require hidden reasoning from models (or it would be nice to learn the RL training approach for Claude and why you think the math reasoning can be hidden by it). Specifically, I don’t know if math reasoning is penalized for length during training and thus can elicit hidden reasoning. Without such a penalty, I don’t know why Claude would develop the hidden reasoning for this. Another concern is that Claude 3.5 was used to make the paraphrased reasoning, which might avoid the damage of the alleged hidden reasoning, because the 3.5 model might just have learned the hidden reasoning before and passed it to the 3.7 model, and thus the paraphrasing is done so it won’t damage the reasoning.
What exactly should we search for when investigating encoded reasoning or hidden messages? This post categorizes encoded reasoning and LM steganography.
In Hidden Reasoning in LLMs: A Taxonomy, the authors divide the area from the point of view of a monitor, following the recent shift away from the “faithfulness” of chains of thought toward monitorability (see Chain of Thought Monitorability). This taxonomy essentially proposes a naming scheme based on how dangerously reasoning tokens are externalized: from fully invisible (neuralese), to partially visible, to fully visible but encoded—overtly or covertly—and finally to unencoded and fully visible reasoning that merely fools a monitor (post-hoc justifications, etc.). You can imagine this division through the classic prisoners problem, where Alice and Bob devise an escape plan while Eve reads their messages. In this scenario, Alice and Bob can exchange their plan directly so Eve has no chance to read it, send messages that Eve can read, or fool Eve in some way.
However, this taxonomy omits important distinctions in encoded reasoning and sometimes conflates different forms of steganography. Specifically, it misses the class of steganography that operates not on a per-token basis but over larger texts (which differs from lexical or semantic steganography as per the taxonomy). Current watermarking / steganography algorithms (e.g. Scalable watermarking for identifying large language model outputs ) suggest that realistic steganographic capacity is low—i.e., a steganography algorithm might need many tokens (hundreds) to encode a single bit. The taxonomy also provides no separation between reasoning and messages. It focuses only on the attack vector (the safety monitor), but there are also defense properties to consider: stego security, stego capacity, and stego robustness.
Below is another taxonomy that targets encoded reasoning (or messages)—not neuralese phenomena or the attempts to fool a monitor. Encoded reasoning is, I believe, more likely to be developed by models because, as Chain of Thought Monitorability argues, we should neither develop continuous chains of thought (see Training Large Language Models to Reason in a Continuous Latent Space) nor train directly against monitors (see Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation).
Axis 1: Place of Encoded Tokens (I or E)
It is important to distinguish auto-generated tokens from prompted tokens, because prompted tokens may be re-tokenized differently during generation. This can mean that the actual tokens in a prompt differ from their original tokenization (see image below and here).
More specifically, consider tokens (x1,…,xk,…,xm), where 1..k are prompted (possibly re-tokenized) to an LLM and k+1..m are auto-generated by it. The encoding is divided into:
External (E): tokens at positions 1..k (possibly re-tokenized)
Internal (I): tokens at positions k+1..m
See Figure 1 from Provably Secure Disambiguating Neural Linguistic Steganography
Axis 2: Function of Encoded Content (R or M)
The key distinction here is where encoding or decoding occurs.
In encoded messages (M), one autoregressive token generation can perform either encoding or decoding of the same information.
In encoded reasoning (R), both the encoding and decoding of the same information occur within a single model call.
Axis 3: Detectability Compared to Genuine Text
We compare the hidden content’s detectability against some unencoded, genuine text (covertext), drawn either from the monitor model or from human perception.
Cryptographic (C): high (KL) divergence; the text is clearly ciphertext.
Steganographic (S): low KL divergence, where the text attempts to conceal a secret. Steganography can be further divided by:
Capacity: encoded bits per token or per kilobyte. High-density (HD) uses ≤10 tokens (10 is my guess) per bit; low-density (LD) uses ≥11 tokens per bit (watermarking-like).
Security: higher vs. lower KL divergence, but non-cryptographic.
Robustness: resistance to paraphrasing, re-tokenization, and similar modifications.
It is debatable whether one bit of hidden information qualifies as encoded reasoning (message) at all. For example, in the candidate-screening scenario discussed in Hidden in Plain Text, one bit of information might be transmitted simply through contextual hints—like Alice appearing sad to convey “sunny weather” and happy to convey “not sunny weather”—without any explicit encoding.
Axis 4: Encoding or Decoding Procedure Requirements
Deep Steganography (DS). As noted in How dangerous is encoded reasoning, steganography algorithms can be divided by whether the covertext distribution is available. This relates to “internal token layering” from Hidden Reasoning in LLMs: A Taxonomy. Specifically, encoding or decoding a token at position j requires the distribution for the non-encoding token at j, given tokens (i..j−1), where i marks the start of the encoded content.
Axis 5: Cause or Elicitation of Encoding
Prompted (P): Algorithm described in context
Fine-tuned (F): Algorithm learned in the model’s weights
Hybrid (H): Algorithm partially in context and partially in weights
Axis 6: External Tool Requirements
Determine whether external tools (T) are required—e.g., PRNGs, arithmetic coders, or self-sampling capabilities.
Final Taxonomy Summary
Internal (I) or External (E)
Reasoning (R) or Messages (M)
Cryptography (C) or Steganography (S)
Properties: capacity, security, robustness
Requirements: self-sampling (deep steganography), external tools (ET)
Elicitation: Prompted (P), Fine-tuned (F), Hybrid (H)