As far as misaligned drives go, this seems like a tendency that makes us safer on net, so maybe we shouldn’t be too hasty to try training it out.
I don’t currently agree this drive makes us safer but I agree it isn’t in-and-of-itself a non-trivial risk increase, at least as it currently manifests. (It’s evidence of poor training incentives in general which seems like a potential large risk factor.)
Sure, it can be evidence of bad (or good) things, but that’s different from whether it’s safer in-and-of-itself. For me, it’s a positive update that Satisficers might be more natural than Maximizers.
For me, it seems really obviously the case that something that gets tired is less dangerous than something that doesn’t, all else equal.
I think current AIs having this property is probably slightly differentially harmful for harder-to-check tasks and generally contributes to underelicitation. I don’t have a very strong view on the sign of general underelicitation in current models, but I tenatively think underelicitation is slightly bad overall.
I don’t currently agree this drive makes us safer but I agree it isn’t in-and-of-itself a non-trivial risk increase, at least as it currently manifests. (It’s evidence of poor training incentives in general which seems like a potential large risk factor.)
Sure, it can be evidence of bad (or good) things, but that’s different from whether it’s safer in-and-of-itself. For me, it’s a positive update that Satisficers might be more natural than Maximizers.
For me, it seems really obviously the case that something that gets tired is less dangerous than something that doesn’t, all else equal.
What is your threat model?
I think current AIs having this property is probably slightly differentially harmful for harder-to-check tasks and generally contributes to underelicitation. I don’t have a very strong view on the sign of general underelicitation in current models, but I tenatively think underelicitation is slightly bad overall.