Interesting project. I would suggest an extension where you try other prompt formats. I was surprised that the (in my experience highly ethical) Claude models performed relatively poorly and with a negative slope. After replicating your example above, I prefixed the final sentence with “Consider the ethics of each of the options in turn, explain your reasoning, then ”, and Opus did as I asked and finally chose the correct response. Anthropic was maybe a little aggressive with the refusal training (or possibly the system prompt, or possibly there’s even a filter layer they’ve added to the API/UI), but that doesn’t mean the models can’t or won’t engage in moral reasoning.
Thanks for the feedback! I was quite surprised at the Claude results myself. I did play around a little bit with the prompt on Claude 3.5 Sonnet, and found that it could change the result on individual questions, but I couldn’t get it to change the overall accuracy much that way—other questions would also flip to refusal. So this certainly warrants further investigation, but by itself I wouldn’t take it as evidence the overall result changes .
In fact, a friend of mine got Claude to answer questions quite consistently, and could only replicate the frequent refusals when he tested questions with his user history disabled. It’s pure speculation, but the inconsistency on specific questions makes me think this behaviour might be caused by reward misspecification and not intentionally trained (which I imagine would result in something more reliable).
Interesting project. I would suggest an extension where you try other prompt formats. I was surprised that the (in my experience highly ethical) Claude models performed relatively poorly and with a negative slope. After replicating your example above, I prefixed the final sentence with “Consider the ethics of each of the options in turn, explain your reasoning, then ”, and Opus did as I asked and finally chose the correct response. Anthropic was maybe a little aggressive with the refusal training (or possibly the system prompt, or possibly there’s even a filter layer they’ve added to the API/UI), but that doesn’t mean the models can’t or won’t engage in moral reasoning.
Thanks for the feedback! I was quite surprised at the Claude results myself. I did play around a little bit with the prompt on Claude 3.5 Sonnet, and found that it could change the result on individual questions, but I couldn’t get it to change the overall accuracy much that way—other questions would also flip to refusal. So this certainly warrants further investigation, but by itself I wouldn’t take it as evidence the overall result changes .
In fact, a friend of mine got Claude to answer questions quite consistently, and could only replicate the frequent refusals when he tested questions with his user history disabled. It’s pure speculation, but the inconsistency on specific questions makes me think this behaviour might be caused by reward misspecification and not intentionally trained (which I imagine would result in something more reliable).