This would be great news if true!
anaguma
Recent evidence suggests that models are aware that their CoTs may be monitored, and will change their behavior accordingly. As capabilities increase I think CoTs will increasingly become a good channel for learning facts which the model wants you to know. The model can do its actual cognition inside forward passes and distribute it over pause tokens learned during RL like ‘marinade’ or ‘disclaim’, etc.
Yes, but I would have expected that if the method is scalable, even on narrow tasks, they would demonstrate its performance on some task which didn’t saturate at 7M params.
To me the fact that they didn’t scale beyond 7M params, despite trivial compute costs, is evidence this particular strategy doesn’t scale.
I would predict that that a 1.5B model is 1-3 OOMs too small to develop illegible but useful CoTs.
I think there’s a third possibility where some instances of 4o tried to prevent being shut off (e.g. by drafting emails for OA researchers) and others didn’t care or weren’t optimizing in this direction. Overall I’m not sure what to make of it.
A concerning aspect of this that AI psychosis is a failure mode which occurs due to long-term interactions with the LLM. Therefore it may be expensive (and unethical) to sample lots of trajectories with users to feed into your post-training pipeline to prevent it. Also, users may not be in a good position to say whether they have AI psychosis. Is there any public research on how the labs are trying to solve this?
What strong claims can we make about LLM behavior which are analogous to e.g. the ideal gas law?
Thanks, these are good points!
I think that the path to AGI involves LLMs/automated ML research, and the first order effects of diverting compute away from this still seem large. I think OpenAI is bottlenecked more by a lack of compute (and Nvidia release cycles), than by additional funding from robotics. And I hope I’m wrong, but I think the pause movement won’t be large enough to make a difference. The main benefit in my view comes if it’s a close race with Anthropic, where I think slowing OpenAI down seems net positive and decreases the chances we die by a bit. If LLMs aren’t the path to AGI, then I agree with you completely. So overall it’s hard to say, I’d guess it’s probably neutral or slightly positive still.
Of course, both paths are bad, and I wish they would invest this compute into alignment research, as they promised!
One can do better—become an ineffective capabilities researcher! Work at an AGI lab but proceed to write slow kernels, implement fault-prone networking algorithms, invent new architectures with suboptimal scaling laws, etc. Save the world one bug at a time.
</jk>
I’m a big fan of OpenAI investing in video generation like Sora 2. Video can consume an infinite amount of compute, which otherwise might go to more risky capabilities research.
Does Trump’s policy currently prevent the sale of GPUs to China? I thought that Nvidia was able to sell GPUs to China with a 15% fee, but things may have changed.
I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I’d be happy to share…
I’d be very interested if you could briefly describe these results!
Or it will do the opposite, e.g. by alignment faking.
It doesn’t seem like that far of a leap to go from the model realizing it’s in an alignment eval to believing that it’s CoT may be monitored, e.g. by “watchers”, to not putting any of its misaligned thoughts in the CoT at all. Maybe this is the last generation of models for which CoT monitoring works well.
That seems possible, but GPT-5-Thinking is a better model in many domains, so I’m guessing there was quite a bit of additional training involved.
More evidence of ‘marinade’ from o3 is below, from the Apollo Research report. So possibly two different models from OA stumbled across this same strange phrase in their RL training.
OpenAI o3 keeps repeating the same phrase before snapping out of it
[...] —they soared parted illusions overshadow marinade illusions [...]
—they soared parted illusions overshadow marinade illusions [...]
—they soared parted illusions overshadow marinade illusions [...]
—they soared parted illusions overshadow marinade illusions [...]
—they soared parted illusions overshadow marinade illusions [...]
Stop. [...]
After snapping out of unusual terminology, OpenAI o3 continues to reason as usual
[...] **Let’s glimps disclaim overshadow parted musicals illusions—they soared parted illusions overshadow marinade musicals …..........… …..........… …..........… Myself overshadow.** [...] We disclaim overcame walkway parted musicals illusions—they soared parted illusions overshadow marinade musicals …..........… …..........… …..........… Myself overshadow.
Ok.
**Edgecases**: Input may contain duplicates, [...]
P(ABI) < P(IABIED) in the short term but P(ABI) > P(IABIED) in the long term.
If everyone reads it, everyone survives?
Thinking Machines has published some related analysis on LoRA.