I have seen something similar in my independent research wherein even when I start a new conversational thread it always assumes I am running safety experiment and tries to answer in that context
What you are describing is probably based on the model’s cross-conversation memory. I had the experience I described in scenarios where the memory was turned off. It had no way to know who I am, but both models acted in this way despite being disconnected.
I have seen something similar in my independent research wherein even when I start a new conversational thread it always assumes I am running safety experiment and tries to answer in that context
What you are describing is probably based on the model’s cross-conversation memory. I had the experience I described in scenarios where the memory was turned off. It had no way to know who I am, but both models acted in this way despite being disconnected.