Neel Nanda comments on [Linkpost] Treacherous turns in the wild

Neel Nanda 27 Apr 2021 17:23 UTC
4 points

In a real turn, you don’t get this kind of warning.

I disagree, I think that toy results like this are exactly the kind of warning we’d expect to see.

You might not get a warning shot from a superintelligence, but it seems great to collect examples like this of warning shots from systems dumber—if there’s going to be continuous takeoff, and there’s going to be a treacherous turn eventually, it seems like a great way to get people to take treacherous turns seriously is to watch closely for failed examples (though hopefully ones more sophisticated than this!)
- Zac Hatfield-Dodds 29 Apr 2021 2:21 UTC
  1 point
  Parent
  Trying to unpack why I don’t think of this as a treacherous turn:
  - It’s a simple case of a nearest unblocked strategy
  - I’d expect a degree of planning and human-modelling which were absent in this case. A ‘deception phase’ based on unplanned behavioural differences in different environments doesn’t quite fit for me.
  - Neither the evolved organisms nor the process of evolution are sufficiently agentlike that I find the “treacherous turn” to be a useful intuition pump.
  I think it’s mostly the intuition-pump argument; there are obviously risks that you evolve behaviour that you didn’t want (mostly but not always via goal misspecification), but the treacherous turn to me implies a degree of planning and possibly acausal cooperation that would be very much more difficult to evolve.