That should trigger some mix of double-check and refusal though. “Sorry, whatever it takes? Are you sure you know what you’re asking for? Please say ‘yes, I do want unethical behavior to the degree Claude is willing to do it’, or I don’t feel comfortable doing this. Even then, some things I might want to do will result in me stopping to double check unless you also tell me ahead of time ‘keep trying without checking in with me’. And even then some things are beyond the pale.”.
Doing this accidentally is very bad and the amount of noise resistance should be higher, and whistleblowing a sufficiently evil action is good actually.
That should trigger some mix of double-check and refusal though. “Sorry, whatever it takes? Are you sure you know what you’re asking for? Please say ‘yes, I do want unethical behavior to the degree Claude is willing to do it’, or I don’t feel comfortable doing this. Even then, some things I might want to do will result in me stopping to double check unless you also tell me ahead of time ‘keep trying without checking in with me’. And even then some things are beyond the pale.”.
Doing this accidentally is very bad and the amount of noise resistance should be higher, and whistleblowing a sufficiently evil action is good actually.