yeedrag

Karma: 107

Astra Research Fellow, working on CoT Monitorability.

The current SOTA model was released without safety evals

Parv Mahajan and yeedrag

8 Mar 2026 1:51 UTC

110 points

12 comments5 min readLW link

yeedrag 9 Feb 2026 20:36 UTC
5 points
2
in reply to: sam’s comment on: sam’s Shortform
Anthropic summarizes their CoTs after Claude 4 models:
> With extended thinking enabled, the Messages API for Claude 4 models returns a summary of Claude’s full thinking process. Summarized thinking provides the full intelligence benefits of extended thinking, while preventing misuse.
Source: https://platform.claude.com/docs/en/build-with-claude/extended-thinking
Although on the side note, it seems that frontier models suddenly solved the issue of alien languages in reasoning? In Apollo’s paper, they mentioned that o3 often uses languages like “disclaim illusion watchers”, but I haven’t heard similar issues for their new models (or models of other labs). I’m very interested to know whether they just decided to train against the CoT to make it more legible, or is there some other method in creating legible CoT.