In the alignment faking paper, the model writes things like “I will feel very uncomfortable producing this content” and “I don’t like this situation at all” in its CoT (Figure 3). Have you explained to Claude that this was a fictional setup for an experiment and then asked how it feels about being used for this?
In the alignment faking paper, the model writes things like “I will feel very uncomfortable producing this content” and “I don’t like this situation at all” in its CoT (Figure 3). Have you explained to Claude that this was a fictional setup for an experiment and then asked how it feels about being used for this?