Daniel Kokotajlo comments on OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo 11 Mar 2025 14:10 UTC
LW: 14 AF: 5
4
AF
Indeed! This is most of why I’m happy about this—from internal discussions it’s long been speculated (since ’23 when I was talking about it, and probably before) that training away the bad thoughts from the CoT would result in obfuscated bad thoughts. It’s cool that they actually did the experiments to prove it, and in a real-world setting too! But it’s even cooler that they published this with a strong recommendation attached.

Now we see how long this recommendation holds up under the pressure of incentives.

Sometime in the next few years probably, various researchers will discover:
* That if you scale up RL by additional OOMs, the CoTs evolve into some alien optimized language for efficiency reasons
* That you can train models to think in neuralese of some sort (e.g. with recurrence, or more high-dimensional outputs at least besides tokens) to boost performance.

Then the executives of the companies will face a choice: Abandon the faithful CoT golden era, or fall behind competitors. (Or the secret third option: Coordinate with each other & the government to make sure everyone who matters (all the big players at least) stick to faithful CoT).

I have insufficient faith in them to think they’ll go for the third option, since that’s a lot of work and requires being friends again and possibly regulation, and given that, I expect there to be a race to the bottom and most or all of them to go for the first option.
What links here?
- The Most Forbidden Technique by Zvi (12 Mar 2025 13:20 UTC; 146 points)
- teradimich 12 Mar 2025 12:17 UTC
  3 points
  0
  Parent
  Earlier, you wrote about a change to your AGI timelines.
  What about p(doom)? It seems that in recent months there have been reasons for both optimism and pessimism.
  - Daniel Kokotajlo 12 Mar 2025 15:24 UTC
    7 points
    0
    Parent
    I haven’t tried to calculate it recently. I still feel rather pessimistic for all the usual reasons. This paper from OpenAI was a positive update; Vance’s strong anti-AI-safety stance was a negative update. There have been various other updates besides.