ollie_ comments on Emergent Introspective Awareness in Large Language Models

ollie_ 31 Oct 2025 1:37 UTC
2 points
0
I wonder if the technique can be used to get around guardrails on open weight models. If you can get the activation pattern for some nefarious concept, you could activate it alongside an innocent prompt’s pattern—maybe getting the llm to take more negative actions.
- ceselder 31 Oct 2025 7:50 UTC
  2 points
  0
  Parent
  Possibly yes, but I don’t think that’s a legitimate safety concern since this can already be done very easily with other techniques. And for this technique you would need to model diff with a nonrefusal prompt of the bad concept in the first place, so the safety argument is moot. But sounds like an interesting research question