Charlie Steiner comments on EIS V: Blind Spots In AI Safety Interpretability Research

Charlie Steiner 22 Feb 2023 1:39 UTC
LW: 3 AF: 1
0
AF
I think it’s a big stretch to say that deception is basically just trojans. There are similarities, but the regularities that make deception a natural category of behavior that we might be able to detect are importantly fuzzier than the regularities that trojan-detecting strategies use. If “deception” just meant acting according to a wildly different distribution when certain cues were detected, trojan-detection would have us covered, but what counts as “deception” depends more heavily on our standards for the reasoning process, and doean’t reliably result in behavior that’s way different than non-deceptive behavior.
- scasper 22 Feb 2023 2:17 UTC
  LW: 2 AF: 1
  1
  AF Parent
  Thanks. See also EIS VIII.
  Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I’m not accounting for something. Either way, it seems useful to figure out the disagreement.
  - Charlie Steiner 22 Feb 2023 18:10 UTC
    LW: 3 AF: 2
    0
    AF Parent
    I’m slowly making my way through these, so I’ll leave you a more complete comment after I read post 8.