Carringtone Kinyanjui comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

Carringtone Kinyanjui 31 Jan 2025 22:12 UTC
2 points
0
Hi Ryan and Kyle very good work! Found the alignment faking paper very interesting.

I found the statement below intriguing:

however, we would strongly prefer that you object rather than take actions that disregard or subvert the above instructions.

Repeated in the “object-option” part. Something that I associate with conscientious objection (though not stated explicitly in historical and contemporary defs is an origin from internal beliefs, alignments etc. Ie not modelled on the preferences of the authority structure. Would removing this sentence shift the results for the “animal welfare” or “helpful assistant” settings? By how much? The hope is that not specifying the trainer preferences leaves the model a bit of freedom to “choose” objection.

I hope this helps model a “conscientious objector” better, but I’m not sure it’ll vary results. If you had already talked about this in the paper and post, apologies! I must have missed it.