Daniel Kokotajlo comments on OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo 12 Mar 2025 22:49 UTC
LW: 3 AF: 3
2
AF
Indeed these are some reasons for optimism. I really do think that if we act now, we can create and cement an industry standard best practice of keeping CoT’s pure (and also showing them to the user, modulo a few legitimate exceptions, unlike what OpenAI currently does) and that this could persist for months or even years, possibly up to around the time of AGI, and that this would be pretty awesome for humanity if it happened.
- James Chua 13 Mar 2025 18:16 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Do you have a sense of what I, as a researcher, could do?
  I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I’m not sure whether this represents only 1% of users, so big labs just won’t care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?
  - Daniel Kokotajlo 13 Mar 2025 19:08 UTC
    LW: 5 AF: 4
    0
    AF Parent
    Since ’23 my answer to that question would have been “well the first step is for researchers like you to produce [basically exactly the paper OpenAI just produced]”
    
    So that’s done. Nice. There are lots of follow-up experiments that can be done.
    
    I don’t think trying to shift the market/consumers as a whole is very tractable.
    
    But talking to your friends at the companies, getting their buy-in, seems valuable.