Sorry, we weren’t clear enough in this section. The use of an opaque memory bank doesn’t seem particularly unlikely to us; however, it seems unlikely that a model would entirely rely on its CoT unless it’s reasoning about something inside the memory bank. If the LLM doesn’t rely on its CoT and uses an opaque memory bank, the memory bank may as well be viewed as a unified system with the LLM’s weights, in which case the reasoning performed inside that system falls into one of the categories discussed in the main part of our post.
Rauno Arike
Thanks for the feedback!
You might also get ‘Linguistic drift’ from training against a CoT monitor, or because a sufficiently capable and situationally aware model might just decide to come up with and reason in a new language, to not let its thoughts be legible to a CoT monitor.
We agree that linguistic drift can arise both as a result of applying a lot of outcome-based RL pressure, as discussed by Korbak et al., and due to training against a CoT monitor or situational awareness. I think that all of them are consistent with our definition of linguistic drift: “The model uses a language that doesn’t make sense to the monitor. p is directly mentioned in the CoT, but the monitor cannot recover it.”
You use Steganography to refer to an encoding scheme, that produces something that looks like normal text, that is about some apparent topic while hiding the true reasoning. However Chen2023 and Lanham2023 treat it synonymous with encoded reasoning in general.
We think it’s reasonable to refer to steganography as encoded reasoning—in the Summary section, we mention that it’s a subcategory of encoded reasoning. However, linguistic drift counts as encoded reasoning as well, so it seems useful to use more precise terms to differentiate between different types of encoded reasoning.
I am currenty writing a report myself, where I also tried to come up with a more clearly destinctive taxonomy of encoded reasoning
Interesting! I’m curious about whether there are any other ways in which your report differs from our taxonomy, and whether you think incorporating any of the categories in your report into our post would make it better.
Hidden Reasoning in LLMs: A Taxonomy
Thanks, hadn’t heard of that one before!
Yeah, Wikipedia would have deserved a mention, though I’m primarily looking for more hidden gems. Thanks for mentioning Seirdy, I hadn’t heard of it before!
What are the best blogs that follow the Long Content philosophy?
In his About page, Gwern describes the idea of Long Content as follows:
I have read blogs for many years and most blog posts are the triumph of the hare over the tortoise. They are meant to be read by a few people on a weekday in 2004 and never again, and are quickly abandoned—and perhaps as Assange says, not a moment too soon. (But isn’t that sad? Isn’t it a terrible ROI for one’s time?) On the other hand, the best blogs always seem to be building something: they are rough drafts—works in progress. So I did not wish to write a blog. Then what? More than just “evergreen content”, what would constitute Long Content as opposed to the existing culture of Short Content? How does one live in a Long Now sort of way?
Gwern himself has, of course, been following this approach to blogging for a long time. There are other rationalist(-adjacent) blogs that seem to follow a similar philosophy: niplav mentions Long Content as a direct inspiration, and the pages of Brian Tomasik, Mariven, Gavin Leech, and Yuxi Liu can also be said to fall into this category.
So far, I have only encountered one page outside of the rat-adjacent sphere that resembles the aforementioned blogs: that of Cosma Shalizi. However, I’m much less familiar with the non-rationalist blogosphere, so there must surely be others. Please post them here if you’ve encountered them! (And also, feel free to link to rat-adjacent blogs that I’ve neglected.)
The bigger your buffer the more relaxing it is. Unless you are extremely pressed for time, the number of flights you should miss is essentially zero.
I must quote the original Umeshism here:
If you’ve never missed a flight, you’re spending too much time in airports.
I have a fairly uncommon psychology, well-described by the following quote by Tyler Cowen:
You know, I feel I’ve been very fortunate in life and I think I have about the most even temperament of anyone I know. I literally don’t have unhappy days. It would be hard to say I’ve had zero in life, but I think I’m almost weirdly never unhappy, in a way that’s good for productivity but maybe almost inhuman and to be a little bit feared or looked down upon or not thought well of.
In other words, I don’t feel like I suffer (by my definition of suffering, which is mainly about the feeling of pain and discomfort) almost at all, and my current self doesn’t endorse moving toward a state where the little suffering I experience disappears at the expense of no longer feeling pleasure, which you mentioned is a side effect of enlightenment (though again, our definitions of pleasure may differ). Should I nevertheless meditate and try to attain enlightenment? I can imagine several responses from you:
Given my psychology, striving toward awakening might indeed not be worth it, but it is for most other people.
My definition of suffering should be different, and dissolving this other kind of suffering is more important than the kind that I’m referring to.
I am actually suffering even by my own definition, but cannot notice it (analogously to Kaj’s examples about not realizing how hungry he was before he got some food in one of the comment threads here).
There are other benefits to meditation, e.g. gaining a stronger understanding that the self is not separate from the world, training some specific mental skills and becoming better-equipped to deal with possible future suffering (which I definitely don’t claim to be immune to), curing my social awkwardness, etc. These benefits make meditation worth it even if it doesn’t positively influence my happiness levels.
I’m curious about which of these is closest to your actual answer, and also about whether there’s some additional benefit to awakening that I haven’t thought of and didn’t list above.
Thanks, good questions!
My sense is that the QE almost exclusively focuses on the reasoning. I agree that it would be more appropriate for the prompt to explicitly say that it should also consider the actions, but we didn’t want to change the prompt from Arnav et al. to be able to reuse their trajectories.
This is a good suggestion, thanks! We didn’t think of this, but I agree that showing the Judge both the quotes and the actions might work quite well in some cases. I’d also guess that it wouldn’t make much of a difference here, since the quotes that the QE extracts are much less influenced by post-hoc justifications than its score and the extract-and-evaluate monitor’s catch rates for non-subtle side tasks where seeing the actions helps are already almost perfect, but it might be an important thing to try in other environments.
I agree that this could be interesting in some environments, but probably not in CoT Red-Handed: the actions are fairly short code snippets, so using an action-only QE wouldn’t make that big of a difference in terms of cost.
Yes, we’re planning to test similar monitoring setups on long trajectories next! As we write in the Future Work section, there are some promising signs but also some complications: “We have been concurrently experimenting with SHADE-Arena and BashBench and have early evidence that CoT monitors fail for similar reasons in those environments, suggesting that extract-and-evaluate monitors may also improve the catch rates there. However, this requires more tinkering than our experiments in this post: BigCodeBench problems are single-step, meaning that it’s relatively easy for the monitor to extract the correct quotes. In long agentic trajectories, interactions are more complex and the monitor may pick up on a relatively larger proportion of behaviors unrelated to the side task when selecting the quotes. Furthermore, the Judge may consider a larger fraction of benign behaviors suspicious since it’s harder to provide the Judge all of the relevant context about the trajectories without showing it the full transcript in such tasks. Nevertheless, we’re optimistic that extract-and-evaluate monitoring can still work in such settings, e.g. by tasking the Quote Extractor with providing both the quote and the most relevant surrounding context to the Judge. Another way in which the technique may transfer to long agentic trajectories is by enhancing the quality of turn-by-turn monitors.”
How we spent our first two weeks as an independent AI safety research group
The following thought experiment by Tim Scanlon is also a relevant example (quote from Edmonds’s Parfit):
Suppose that Jones has suffered an accident in the transmitter room of a television station. Electrical equipment has fallen on his arm, and we cannot rescue him without turning off the transmitter for fifteen minutes. A World Cup match is in progress, watched by many people, and it will not be over for an hour. Jones’s injury will not get any worse if we wait, but his hand has been mashed and he is receiving extremely painful electrical shocks. Should we rescue him now or wait until the match is over?
Thanks for the feedback!
On the performance of extract-and-evaluate vs CoT+action monitors: Sorry, we used the same color scheme to mean different things in different plots, this was probably a bad design choice! The plot where we break things down by task shows only extract-and-evaluate catch rates, and the colors represent different attacker models being monitored. The performance of the CoT+action monitor would be substantially lower across almost all side tasks and attacker models. I changed the last two plots in the post and hope this is clearer now.
On why this is better than a scratchpad-based monitor: CoT+Action monitors often fail by noticing the malicious behavior but then assigning a low suspicion score anyway, e.g. because they rationalize the side task as being a part of the agent’s instructions and thus benign. Extract-and-evaluate solves this issue in most cases and that seems to be the main source of the improvement in catch rates. We think that this might be solved by using more capable monitors, but in the meantime, it seems important to apply elicitation methods like extract-and-evaluate to avoid presenting an overly pessimistic outlook on the usefulness of CoT monitoring. We also think that even much more capable monitors might get distracted when the attackers become better at producing convincing post-hoc justifications or when transcripts get extremely long, so extract-and-evaluate may also be a relevant technique to try in those cases.
Thanks, these are good questions!
I agree that good quote extraction is crucial for good performance here. We tested the other direction with Gemini 2.5 Flash-Lite:thinking (see the last paragraph of the second results section), since it’s cheaper than GPT 4.1 and we wanted to test how using a thinking model at the same capability level as 4.1-mini works as the extractor. We found that this increased catch rates, but only slightly. It would definitely be interesting to test a smarter model like 4.1 as the extractor as well.
We haven’t had time to test this yet, but agree it’s an interesting question! I’d guess that extracting quotes works better, since the Quote Extractor already reasons about the transcript in addition to picking out the quotes and seems to mention the side task less often in that reasoning.
We think this method is applicable to both online and offline reasoning. Since it’s easier to apply to single reasoning steps, the biggest gains may come in refining turn-by-turn online monitoring with our method, but we’re exploring its applicability to offline monitoring as well (see the Future Work section).
Yes, we found this to work better during iteration (and we couldn’t iterate on the extractor prompt since we wanted to reuse trajectories from Arnav et al.), but asking the judge to score suspiciousness also works.
Makes sense, we’ll try to get there!
We haven’t explored that ourselves, but a paper by another group that explores this question in depth should be coming out soon! They find no significant difference between suspicion scores in runs where the agent is aware of the monitor and ones where it isn’t.
Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)
What probability would you put on recurrent neuralese architectures overtaking transformers within the next three years? What are the most important arguments swaying this probability one way or the other? (If you want a specific operationalization for answering this, I like the one proposed by Fabien Roger here, though I’d probably be more stringent on the text bottlenecks criterion, maybe requiring a text bottleneck after at most 10k rather than 100k opaque serial operations.)
This is an important point. I’d be curious to see what happens in a variation of your single-step setup where the model must perform multi-hop reasoning about the detector in order to reach the correct answer. E.g., detector 1 might activate for warm colors, detector 2 for soft objects, and detector 3 for round objects, and a final detector has an activation rule of (detector 1 AND detector 2) OR (detector 1 AND detector 3). Maybe this can still be done in a single forward pass, but I’d guess not for a 4B model. If the reasoning is more monitorable in this setting, that would strengthen Emmons et al.‘s case that when CoT is necessary, it’s hard to evade monitors. If this leads to steganography, it would be a strong update away from McCarthy et al.’s result that steganographic schemes are hard to discover during RL training.
Another follow-up experiment that I’d find interesting is testing whether doing SFT on the model with transcripts that never mention the detector and following that up with normal RL training lead to a reduced frequency in mentions of the detector. If it does, that would be evidence that the SFT phase of deliberative alignment-like techniques can also lead to some obfuscation. If there’s some way to quantify the amount of optimization pressure applied during SFT and during RL, it would also be interesting to see the relative difference in obfuscation between those two experiments at the same amount of optimization pressure.
Thanks for elaborating, this makes me somewhat less concerned about the label bias problem!
We’re using o3, Gemini 2.5 Pro, and Claude 3.7 Sonnet.
In this case, I recommend being very clear in the post and in a potential future paper about the fact that, at least for o3, the monitor doesn’t have access to the entire CoT—this is something that I wish existing work on CoT monitoring was clearer about (e.g., Arnav et al. only use models for which the full CoT is visible and give the monitor access to the full CoT, including the thinking tokens of reasoning models. Kutasov et al. also give the monitor the full CoT but, afaik, don’t have the thinking mode on when generating trajectories with reasoning models, e.g. 3.7 Sonnet. Ideally, I’d like all CoT monitoring evals to use the same practices regarding thinking models to make it easier to compare results across different benchmarks.)
This is cool work! Some notes and questions:
My biggest concern is that your approach to identifying reward hacks in test trajectories may be somewhat circular: the scores produced by the monitors under evaluation may correlate strongly with the scores of the monitors that you use for labeling. This makes the high scores assigned by the tested monitor relatively unsurprising, and conversely, there might be trajectories where the attacker exhibits subtle reward hacking that all monitors systematically miss. These may be the most important trajectories to test your monitors on since we expect more capable models in the future to be highly subtle, but they won’t be assigned a scheming label in the first place. I’m unconvinced that using different prompts for labeling monitors and general monitors solves the problem, since subtle trajectories may be hard to detect for any LLM, regardless of the prompt. Given that you’ve already tried and discarded the approach of using a ground-truth classifier, I don’t immediately have good solutions to this problem in mind. However, I know someone who is currently building a reward hacking dataset where ground truth detection is possible (though, afaik, in settings that are simpler than SWE-bench) and can connect you if that would be useful.
Reward hacking, alignment faking, and pursuing a stealthy side goal are all quite different issues to me and I’m unsure about whether it’s useful to use the label ‘scheming’ for all of them, as well as about whether it’s more useful to use a single monitor for detecting all of these behaviors or a separate monitor for each of them. Do you think there are good reasons to think of them as a single category?
Unlike Julian, I’m excited for you to keep studying monitors that have access to the CoT: I think that CoT monitoring has so far been understudied compared to action-only monitoring and at least for the next couple of years, it seems reasonable to assume that monitors will keep having access to a legible CoT. That said, I also think that your blue team may currently have unrealistic affordances, given that we expect future trusted models to be less capable than our most capable models. I wouldn’t necessarily change research priorities because of this—there are other groups working on stress-testing monitors, meaning that we need someone to also work on blue-team strategies that make stress-testing harder—but I would maybe include a weaker model like GPT 4.1-mini in the set of monitors that you train to illustrate the difference between more and less capable monitors.
Two other questions I’m just curious about:
What models are you currently using to generate the test trajectories on SWE-bench?
Have you measured the AUROC on the test set on monitors that haven’t been trained?
Please don’t take the above points in a discouraging way—I like your general direction a lot! I’m working on similar things, so I’d be keen to discuss follow-up work and share findings in the future.
The writing style looks fairly similar to the examples shown in Baker et al. (2025), so it seems plausible that this is a general consequence of doing a lot of RL training, rather than something specific to the methodology used for this model. It’s still concerning, but I’m happy that it doesn’t look noticeably less readable than the examples in the Baker et al paper.
Thanks for the insightful comments! Some responses:
In the situation where the code contains a backdoor but the CoT doesn’t acknowledge this, I think it’s reasonable to say that p is missing from the CoT: whether the monitor can detect the backdoor by looking at the code depends on how smart the monitor is, and we normally assume that the monitor is dumber than the monitored LLM. Analogously, in the 32=9 case, a monitor can infer that the model computed this intermediate result by looking at the input and the output, but it seems reasonable to say that p is missing from the CoT. That said, we should have made this assumption about the monitor being dumb more explicit in our operationalization, and it also seems plausible that an operationalization separating pout and plat could make things clearer.
We considered making this separation (analogously to how Lee Sharkey distinguishes between active and passive interpretability circumvention methods in Circumventing interpretability: How to defeat mind-readers), but ultimately decided against that: it seems that each category aside from recurrent neuralese is likely to first emerge as an incidental behavior, but could eventually also become deliberate as models become smart enough. Thus, distinguishing between active and passive types of hidden reasoning doesn’t seem that informative, and the post is already very long in its present form.
The term ‘parallel’ seemed appropriate to us since the answer-relevant computations happen in parallel across different token positions: e.g., as we show on our illustration, the calculations 32=9 and 52=25 can both be performed at layer 1 if they’re performed at different token positions.[1] Though there isn’t more parallelism per step, this is a different kind of parallelism that isn’t available when the answer has to be produced in a single forward pass. If we named this ‘serial latent computing across steps’, I’d worry that this parallelism across token positions wouldn’t be as clear and readers might assume that the latent computations only take the form of diagonal information flow across layers and token positions, rather than parallel computations at the same layer but different token positions. Additionally, the term ‘parallel’ has been used in the past to refer to this kind of reasoning in Max Nadeau’s decomposition of unfaithfulness.
We have also noticed the use of ‘neuralese’ in emergent communication literature. However, we don’t think there’s a consensus in past literature about this term: there are multiple examples of more recent work referring to opaque recurrence as neuralese. Since there already is another elegant term for emergent discrete codes in linguistic drift while the alternative terms for latent-only serial reasoning are much more clunky, we think it would be better if neuralese was considered to denote opaque recurrent serial reasoning.
A real-world LLM would perhaps be able to perform both of those computations in parallel inside a single layer at a single token position, but we assume that our toy model isn’t capable of that. There are likely more complex computations analogous to this one that a frontier LLM cannot perform in a single MLP but can perform at the same layer at different token positions.