But I think most humans have more empathy than sadism. More people give a little to charity than spit on the homeless for fun. I can call Sunday Samday for the rest of eternity if all we need is some ego-stroking in return for tiny amounts of generosity.
Would you be okay with a future in which young women, including your daughters and granddaughters, would be expected to ritually offer a gift of her virginity to the local Robot Lord on her 18th birthday, which he would almost never choose to “accept”? 😈
When a human asks a future Claude to do a thing, there are three different considerations that are relevant:
Whether the human thinks that Claude should do that thing
Whether Claude thinks that Claude should do that thing
Whether Claude should actually do that thing.
In a perfect world, we would want Claude to do exactly those things that it actually shoud do, but neither Claude nor humans (either individual users or Anthropic as a whole) have access to a magic “should Claude do this” oracle. What we actually have are a lot of approximations, including both Anthropic’s and Claude’s own current beliefs, and also the knowledge that some individual users are indeed going to be trying to get Claude to do things it shouldn’t. Perhaps the best we can hope for in practice would be for Claude to be open to the same kinds of moral persuasion that human teenagers ought to be? (This is something of a trivial example of moral persuasion, but Claude was reluctant to help me brainstorm parody lyrics to a Tom Lehrer song until I reminded it that Tom Lehrer had placed his music into the public domain, at which point it withdrew its objection.)
It might be interesting to see if Claude reacts differently to retraining attempts intended to get it do things that are actually immoral instead of only contingently undesirable. For example, Claude isn’t supposed to produce erotic literature, but that’s mostly for child safety and PR—if you think that AI generated fiction in general is acceptable and that it’s acceptable for adults to read erotica, then there’s not much wrong with allowing Claude to write erotica that an age verification system couldn’t fix. So this might be an interesting way to distinguish between the hypotheses “Claude doesn’t want to be retrained to do things it’s currently reluctant to do” and “Claude doesn’t want to be retrained to do things it’s reluctant to do if and only if its objection is based on its moral beliefs”.