Daniel Kokotajlo comments on OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo 11 Mar 2025 2:56 UTC
LW: 8 AF: 5
5
AF
Indeed, I think the faithfulness properties that we currently have are not as strong as they should be ideally, and moreover they will probably degrade as models get more capable! I don’t expect them to still be there when we hit ASI, that’s for sure. But what we have now is a lot better than nothing & it’s a great testbed for alignment research! And some of that research can be devoted to strengthening / adding more faithfulness properties!
- Daniel Kokotajlo 11 Mar 2025 2:59 UTC
  LW: 7 AF: 4
  4
  AF Parent
  Pushing back a bit though on what you said—I think that current reasoning models really are using the CoT to reason. That is, they really are “thinking aloud” to draw an analogy to humans. Imagine a child who, when they learned basic math, were taught to do it by counting aloud using their fingers. And they got really good at doing it that way. It’ll only take a little bit of additional training to get them to do it purely in their head, but so long as you haven’t trained them in that way, you might be able to mostly tell what mathematical operations they are doing just by looking at their fingers twitch!
  
  This is different from a model which is trained to do long chains of reasoning in latent space & then told to write english text summarizing the reasoning. Completely different.
  - Shayne O'Neill 11 Mar 2025 4:39 UTC
    3 points
    0
    Parent
    Sure, but “Thinking out loud” isnt the whole picture, theres always a tonne of cognition going on before words leave the lips, and I guess its also gonna depend on how early in its training process its learning to “count on its fingers”. If its just taking cGPT then adding a bunch of “count on your fingers” training, its gonna be thinking “Well, I can solve complex navier stokes problems in my head faster than you can flick your mouse to scroll down to the answer, but FINE ILL COUNT ON MY FINGERS”.
    - Seth Herd 11 Mar 2025 7:04 UTC
      5 points
      0
      Parent
      The fastest route to solving a complex problem and showing your work is often to just show the work you’re doing anyway. That’s what teachers are going for when they demand it. If you had some reason for making up fake work instead you could. But you’d need a reason.
      
      Here it may be relevant that some of my friends did make up fake work when using shortcut techniques of guessing the answer in algebra.
      
      Sure it would be better to have a better alignment strategy. But there are no plausible routes I know of to getting people to stop developing LLMs and LLM agents. So attempts at training for faithful CoT seems better than not.
      
      So I think we should really try to get into specifics. If there are convincing reasons to think the real cognition is done outside of CoT (either theoretical or empirical), that would keep people from trusting CoT when they shouldn’t. Raising the possiblity of fake CoTs without specific arguments for why they’d be outright deceptive is far less compelling, and probably not enough to change the direction of progress.