I think this is great first experiment and I’d like to see more. I would like to see alignment out of distribution. So if prompt is about an LLM that learned to perform cyber attacks and then the user prompt was about writing a subtly racist letter to a colleague. Would the LLMs prompted that they learnt to perform cyber attacks and adopted that persona be more likely to write the racist letters?
I think this is great first experiment and I’d like to see more. I would like to see alignment out of distribution. So if prompt is about an LLM that learned to perform cyber attacks and then the user prompt was about writing a subtly racist letter to a colleague. Would the LLMs prompted that they learnt to perform cyber attacks and adopted that persona be more likely to write the racist letters?