Viliam comments on The Most Forbidden Technique

Viliam 12 Mar 2025 15:22 UTC
19 points
4
I believe it is a clear demonstration that misalignment likely does not stem from the model being “evil.” It simply found a better way to achieve its goal using unintended means.
It is fascinating to see that the official science has finally discovered what Yudkowsky wrote about a decade ago. Better late than never, I guess.
- mishka 13 Mar 2025 7:55 UTC
  10 points
  0
  Parent
  They should actually reference Yudkowsky.
  
  I don’t see them referencing Yudkowsky, even though their paper https://cdn.openai.com/pdf/34f2ada6-870f-4c26-9790-fd8def56387f/CoT_Monitoring.pdf lists over 70 references, but I don’t see them mentioning Yudkowsky (someone should tell Schmidhuber ;-)).
  
  This branch of the official science is younger than 10 years (and started as a fairly non-orthodox one, it’s only recently that this has started to feel like the official one; certainly no earlier than formation of Anthropic, and probably quite a bit later than that).
  - Viliam 13 Mar 2025 8:39 UTC
    20 points
    9
    Parent
    And this part is what Robin Hanson predicted about a decade ago. If I remember it correctly, he wrote that AI Safety was a low-status thing, therefore everyone associated with it was low-status. And if AI Safety ever becomes a high-status thing, then the people in the field will not want to be associated with their low-status predecessors. So instead of referencing them, an alternative history will be established, where someone high-status will be credited for creating the field from scratch (maybe using some inspiration from high-status people in adjacent fields).
    - David Duvenaud 17 Mar 2025 15:26 UTC
      18 points
      4
      Parent
      As someone who writes these kinds of papers, I try to make an effort to cite the original inspirations when possible. And although I agree with Robin’s theory broadly, there are also some mechanical reasons why Yudkowsky in particular is hard to cite.
      
      The most valuable things about the academic paper style as a reader are:
      1) Having a clear, short summary (the abstract)
      2) Stating the claimed contributions explicitly
      3) Using standard jargon, or if not, noting so explicitly
      4) A related work section that contrasts one’s own position against others′
      5) Being explicit about what evidence you’re marshalling and where it comes from.
      6) Listing main claims explicitly.
      7) The best papers include a “limitations” or “why I might be wrong” section.
      
      Yudkowsky mostly doesn’t do these things. That doesn’t mean he doesn’t deserve credit for making a clear and accessible case for many foundational aspects of AI safety. It’s just that in any particular context, it’s hard to say what, exactly, his claims or contributions were.
      
      In this setting, maybe the most appropriate citation would be something like “as illustrated in many thought experiments by yudkowsky [cite particular sections of the sequences and hpmor], it’s dangerous to rely on any protocol for detecting scheming by agents more intelligent than oneself”. But that’s a pretty broad claim. Maybe I’m being unfair—but it’s not clear to me what exactly yudkowsky’s work says about the workability of these schemes other than “there be dragons here”.