Jasmine Brazilek comments on Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?

Jasmine Brazilek 28 Jan 2026 19:30 UTC
3 points
0
I think this is great first experiment and I’d like to see more. I would like to see alignment out of distribution. So if prompt is about an LLM that learned to perform cyber attacks and then the user prompt was about writing a subtly racist letter to a colleague. Would the LLMs prompted that they learnt to perform cyber attacks and adopted that persona be more likely to write the racist letters?