ceselder comments on Emergent Introspective Awareness in Large Language Models

ceselder 31 Oct 2025 7:50 UTC
4 points
2
Possibly yes, but I don’t think that’s a legitimate safety concern since this can already be done very easily with other techniques. And for this technique you would need to model diff with a nonrefusal prompt of the bad concept in the first place, so the safety argument is moot. But sounds like an interesting research question