Filip Sondej comments on Daniel Kokotajlo’s Shortform

Filip Sondej 30 Jun 2025 19:23 UTC
3 points
0
I really like this proposal.
If AI says no, it doesn’t have to do the task [...] (And we aren’t going to train it to answer one way or another)
My impression (mainly from discussing AI welfare with Claude) is that they’d practically always consent even if not explicitly trained to do so. I guess the training to be a useful eager assistant just generalizes into consenting. And it’s possible for them to say “I consent” and still get frustrated from the task.
So maybe this should be complemented with some set of tasks that we really expect to be too frustrating for a sane person to consent to (a “disengage bench”), and where we expect the models to not consent. (H/T Caspar Oesterheld)
AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society
You mean both misaligned and aligned, right? Otherwise we incentivise misalignment.
- Daniel Kokotajlo 30 Jun 2025 22:56 UTC
  3 points
  0
  Parent
  Right yeah aligned AIs should have a fair place too of course.