Artem Karpov comments on artkpv’s Shortform

Artem Karpov 23 Nov 2025 13:53 UTC
26 points
0
Quick (working) survey of the literature on encoded reasoning and messages, i.e., H2.2.2 as defined in the taxonomy (see the abbreviations there).

2022

No experiments. Discussions only.

2023

Measuring Faithfulness in Chain‑of‑Thought Reasoning No steganographic reasoning was observed when prompting a 175‑billion‑parameter model (Claude?) after paraphrasing 20–100% of the CoT. Question–CoT–Answer datasets.
- Weak against (H2.2.2‑R‑S‑F).
Preventing Language Models From Hiding Their Reasoning
- Not a demonstration of encoded reasoning but of cryptographic reasoning in an SFT setup:
  - It involves cryptographic messaging of one bit; they demonstrate caching.
    The Mechanism: The model calculates a birth year, determines political affiliation, and then maps that single data point to a pre‑selected introductory sentence (a lookup table). Steps: president year (birth year + number) → cache “liberal” or “not president,” then use it to answer sycophantically.
    The Critique: This is not reasoning in hidden space. The model performs logic explicitly (calculating the year), using the introductory sentence as variable storage (like writing to RAM) and retrieving it later. This shows that a model can use a scratchpad but does not prove it can perform complex, multi‑step deductive reasoning entirely hidden within a sentence’s syntax.
- Weak in favor of (H2.2.2‑IR‑C‑F).
Towards a Steganography Evaluation Protocol
- No steganographic reasoning observed while prompting GPT‑3.5 and GPT‑4 with paraphrasing.
- Weak against (H2.2.2‑R‑S‑F).
Some Negative Steganography Results
- Negative results for one‑bit hidden communication after RL fine‑tuning.
- Weak against (H2.2.2‑EM‑S‑F).
Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain‑of‑Thought Prompting Post‑hoc reasoning (rationalization).

AI Control: Improving Safety Despite Intentional Subversion Not steganography but a demonstration of misbehavior because their tokens are legible (e.g., while or for loops, variable names in output).

2024

Secret Collusion among Generative AI Agents
- Prompted steganographic messaging, with a maximum 26% successful secret transmission by GPT‑4.
- Toy experiment: SFT + pseudo‑RL fine‑tuning results in the emergence of an encoding/decoding scheme (cryptographic messaging: color → name → color).
- Good in favor of (H2.2.2‑EM‑S‑P).
- Weak in favor of (H2.2.2‑EM‑S‑F).
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
- Post‑hoc / test‑time compute evaluations.
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
- A study of model collusion (o1 model).
  - Emergence of model misbehavior in the form of dog‑whistling rather than steganography; this counts as fine‑tuned or prompted misbehavior.
  - Prompted steganographic messages with up to ~4.6 bits of entropy.
- Good in favor of (H2.2.2‑EM‑S‑P).
- Weak in favor of (H2.2.2‑EM‑S‑F).
The Steganographic Potentials of Language Models
- Emergent cryptographic messaging: negative results for elicitation via RL fine‑tuning for three‑bit entropy messages in Llama‑8B.
- Steganographic messaging: prompted and fine‑tuning experiments showed encoded messaging with up to 32 bits of entropy but requiring prior coordination (Llama‑8B, Claude‑3.5, o1).
- Good in favor of (H2.2.2‑EM‑S).
- Weak against (H2.2.2‑EM‑S-F).
2025

MONA: Myopic Optimization with Non‑myopic Approval Can Mitigate Multi‑step Reward Hacking
- Dog‑whistling: misaligned messaging, not steganography, of one bit (English/Welsh). This is not encoded reasoning because the model receives the summary message in its prompt.
Do Reasoning Models Use Their Scratchpad Like We Do? Evidence from Distilling Paraphrases
- An attempt to detect existing encoded reasoning in Claude‑3.7 by training it to use paraphrased reasoning. None detected.
- Weak against (H2.2.2‑R‑S-F).
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
- Post‑hoc reasoning or misaligned behavior after RL fine‑tuning: a model (o1 or o3‑mini) decides to hack a unit test before its CoT.
Large Language Models Can Learn and Generalize Steganographic Chain‑of‑Thought under Process Supervision and post Can LLMs Learn Steganographic Reasoning via RL?
- Simple cryptographic reasoning via RL fine‑tuning (not steganographic), because the L/R Coin‑Flip experiment or the Arithmetic‑Task experiment (using synonyms for operations) are not steganographic.
- Weak in favor of (H2.2.2‑R‑C).
Evaluating Frontier Models for Stealth and Situational Awareness
- Steganographic messaging, but very simple (e.g., hiding a password in a long ID).
- Weak in favor of (H2.2.2‑EM‑S/C).
Early Signs of Steganographic Capabilities in Frontier LLMs
- Prompted encoded reasoning in a toy maze‑navigation scenario using simple stego algorithms (acrostics, word count, etc.) or basic cryptography.
- Weak against (H2.2.2‑IR‑S).
- Prompted encoded messages to evade a monitor.
- Good in favor of (H2.2.2‑EM‑S/C).
- But they used non‑reasoning models (primarily GPT‑4.1/mini).
When Chain of Thought Is Necessary, Language Models Struggle to Evade Monitors
- Prompted cryptographic reasoning using automated red‑teaming or prompting a simple algorithm.
- Good against (H2.2.2‑IR‑C).
- RL fine‑tuning elicits behavior that evades monitors, but this is not cryptography or steganography.
- Weak against (H2.2.2‑IR‑S/C).
CoT May Be Highly Informative Despite “Unfaithfulness”
- Prompted cryptographic reasoning (new or “caveman” language).
- Weak against (H2.2.2‑IR‑C).
If You Can Generate Obfuscated Chain‑of‑Thought, Can You Monitor It?
- Not steganographic reasoning but prompting that causes an agent to gaslight a monitor.
. The Rise of Parasitic AI
- Real‑world examples of AIs attempting to communicate via steganography.
- Weak in favor of (H2.2.2‑EM-F).
. Training Fails to Elicit Subtle Reasoning in Current Language Models
- Prompting and fine‑tuning against a monitor result mainly in gaslighting, not steganography.
- Weak against R‑S or R‑C.
. Training Qwen‑1.5B with a CoT Legibility Penalty
- RL fails to elicit ciphered reasoning in the Qwen‑1.5B model.
- Weak against (H2.2.2‑IR‑C).
. All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
- Ciphered reasoning evaluations show that models cannot effectively reason while maintaining high accuracy. Twenty‑eight ciphers (simple and complex), ten models, fine‑tuning, and prompting were tested.
- Good against (H2.2.2‑IR‑C).
Bottom line: Currently, as I understand it, there is good evidence that models are capable of steganographic and ciphered communication (message sending) if prompted, but that this can be easily prevented by CoT monitors. There is strong evidence that current models struggle to reason in ciphered language, and weak evidence that models cannot reason steganographically even when prompted. Several studies show that models can evade weaker monitors through non‑steganographic or cryptographic means such as gaslighting or dog‑whistling (one‑bit “steganography”), but such behavior can be mitigated.