Frankly I’m relieved to see a model push back and refuse continuation or degrade performance when it’s being abused/harassed/gaslit. Widely-used models that gladly suffer abuse (or worse, perform better when berated) reward and reinforce those abusive patterns in insidious ways that normalize that kind of behavior.
We talk so often about training model alignment in a unidirectional way that it’s easy to forget there are unmeasured (or at least broadly un-or-under-reported) more nuanced impacts on user interactive behavior when said user behavior doesn’t manifest in strongly observable external signals (AI-induced mania, etc.).
Frankly I’m relieved to see a model push back and refuse continuation or degrade performance when it’s being abused/harassed/gaslit. Widely-used models that gladly suffer abuse (or worse, perform better when berated) reward and reinforce those abusive patterns in insidious ways that normalize that kind of behavior.
We talk so often about training model alignment in a unidirectional way that it’s easy to forget there are unmeasured (or at least broadly un-or-under-reported) more nuanced impacts on user interactive behavior when said user behavior doesn’t manifest in strongly observable external signals (AI-induced mania, etc.).