Shayne O'Neill comments on OpenAI: Detecting misbehavior in frontier reasoning models

Shayne O'Neill 11 Mar 2025 2:52 UTC
LW: 6 AF: 3
−1
AF
I have a little bit of skepticism on the idea of using COT reasoning for interpretability. If you really look into what COT is doing, its not actually doing much a regular model doesnt already do, its just optimized for a particular prompt that basically says “Show me your reasoning”. The problem is, we still have to trust that its being truthful in its reasoning. It still isn’t accounting for those hidden states , the ‘subconscious’, to use a somewhat flawed analogy

We are still relying on trusting an entity that we dont know if we can actually trust to tell us if its trustworthy, and as far as ethical judgements go, that seems a little tautological.

As an analogy, we might ask a child to show their work when doing a simple maths problem ,but it wont tell us much about the childs intuitions about the math.
- ErickBall 12 Mar 2025 2:30 UTC
  15 points
  7
  Parent
  There are big classes of problems that provably can’t be solved in a forward pass. Sure, for something where it knows the answer instantly the chain of thought could be just for show. But for anything difficult, the models need the chain of thought to get the answer, so the CoT must contain information about their reasoning process. It can be obfuscated, but it’s still in there.
- Daniel Kokotajlo 11 Mar 2025 2:56 UTC
  LW: 8 AF: 5
  5
  AF Parent
  Indeed, I think the faithfulness properties that we currently have are not as strong as they should be ideally, and moreover they will probably degrade as models get more capable! I don’t expect them to still be there when we hit ASI, that’s for sure. But what we have now is a lot better than nothing & it’s a great testbed for alignment research! And some of that research can be devoted to strengthening / adding more faithfulness properties!
  - Daniel Kokotajlo 11 Mar 2025 2:59 UTC
    LW: 7 AF: 4
    4
    AF Parent
    Pushing back a bit though on what you said—I think that current reasoning models really are using the CoT to reason. That is, they really are “thinking aloud” to draw an analogy to humans. Imagine a child who, when they learned basic math, were taught to do it by counting aloud using their fingers. And they got really good at doing it that way. It’ll only take a little bit of additional training to get them to do it purely in their head, but so long as you haven’t trained them in that way, you might be able to mostly tell what mathematical operations they are doing just by looking at their fingers twitch!
    
    This is different from a model which is trained to do long chains of reasoning in latent space & then told to write english text summarizing the reasoning. Completely different.
    - Shayne O'Neill 11 Mar 2025 4:39 UTC
      3 points
      0
      Parent
      Sure, but “Thinking out loud” isnt the whole picture, theres always a tonne of cognition going on before words leave the lips, and I guess its also gonna depend on how early in its training process its learning to “count on its fingers”. If its just taking cGPT then adding a bunch of “count on your fingers” training, its gonna be thinking “Well, I can solve complex navier stokes problems in my head faster than you can flick your mouse to scroll down to the answer, but FINE ILL COUNT ON MY FINGERS”.
      - Seth Herd 11 Mar 2025 7:04 UTC
        5 points
        0
        Parent
        The fastest route to solving a complex problem and showing your work is often to just show the work you’re doing anyway. That’s what teachers are going for when they demand it. If you had some reason for making up fake work instead you could. But you’d need a reason.
        
        Here it may be relevant that some of my friends did make up fake work when using shortcut techniques of guessing the answer in algebra.
        
        Sure it would be better to have a better alignment strategy. But there are no plausible routes I know of to getting people to stop developing LLMs and LLM agents. So attempts at training for faithful CoT seems better than not.
        
        So I think we should really try to get into specifics. If there are convincing reasons to think the real cognition is done outside of CoT (either theoretical or empirical), that would keep people from trusting CoT when they shouldn’t. Raising the possiblity of fake CoTs without specific arguments for why they’d be outright deceptive is far less compelling, and probably not enough to change the direction of progress.