I agree with your points. I think maybe I’m putting a bit higher weight to the problem you describe here:
One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
because it looks plausible to me that making the models just want-to-do-the-moral-thingy might be our best chance for a good (or at least not very bad) future. So the cost might be high.
Thank you for this response, it clarifies a lot!
I agree with your points. I think maybe I’m putting a bit higher weight to the problem you describe here:
because it looks plausible to me that making the models just want-to-do-the-moral-thingy might be our best chance for a good (or at least not very bad) future. So the cost might be high.
But yeah, no more strong opinions here : ) Thx.