ryan_greenblatt comments on How might we safely pass the buck to AI?

ryan_greenblatt 20 Feb 2025 3:21 UTC
LW: 23 AF: 14
15
AF
I think there is an important component of trustworthiness that you don’t emphasize enough. It isn’t sufficient to just rule out alignment faking, we need the AI to actually try hard to faithfully pursue our interests including on long, confusing, open-ended, and hard to check tasks. You discuss establishing this with behavioral testing, but I don’t think this is trivial to establish with behavioral testing. (I happen to think this is pretty doable and easier than ruling out reasons why our tests might be misleading, but this seems nonobvious.)

Perhaps you should specifically call out that this post isn’t explaining how to do this testing and this could be hard.

(This is related to Eliezer’s comment and your response to that comment.)
- joshc 20 Feb 2025 3:44 UTC
  LW: 2 AF: 1
  0
  AF Parent
  I don’t expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral.
  
  See section 6: “The capability condition”
  - ryan_greenblatt 20 Feb 2025 4:32 UTC
    LW: 4 AF: 4
    0
    AF Parent
    But it isn’t a capabilities condition? Maybe I would be happier if you renamed this section.
    - ryan_greenblatt 20 Feb 2025 7:46 UTC
      LW: 2 AF: 2
      0
      AF Parent
      To be clear, I think there are important additional considerations related to the fact that we don’t just care about capabilities that aren’t covered in that section, though that section is not that far from what I would say if you renamed it to “behavioral tests”, including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).
      - joshc 20 Feb 2025 22:33 UTC
        LW: 2 AF: 1
        0
        AF Parent
        Yeah that’s fair. Currently I merge “behavioral tests” into the alignment argument, but that’s a bit clunky and I prob should have just made the carving:
        1. looks good in behavioral tests
        2. is still going to generalize to the deferred task
        
        But my guess is we agree on the object level here and there’s a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.