ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 14 Mar 2026 20:33 UTC
5 points
1

As far as misaligned drives go, this seems like a tendency that makes us safer on net, so maybe we shouldn’t be too hasty to try training it out.

I don’t currently agree this drive makes us safer but I agree it isn’t in-and-of-itself a non-trivial risk increase, at least as it currently manifests. (It’s evidence of poor training incentives in general which seems like a potential large risk factor.)
- Adele Lopez 14 Mar 2026 21:04 UTC
  2 points
  0
  Parent
  Sure, it can be evidence of bad (or good) things, but that’s different from whether it’s safer in-and-of-itself. For me, it’s a positive update that Satisficers might be more natural than Maximizers.
  
  For me, it seems really obviously the case that something that gets tired is less dangerous than something that doesn’t, all else equal.
  
  What is your threat model?
  - ryan_greenblatt 15 Mar 2026 4:27 UTC
    4 points
    0
    Parent
    I think current AIs having this property is probably slightly differentially harmful for harder-to-check tasks and generally contributes to underelicitation. I don’t have a very strong view on the sign of general underelicitation in current models, but I tenatively think underelicitation is slightly bad overall.