i agree that many of the particular effects you can get with steering vectors aren’t particularly like the effects you can get by flooding the brain with any known neurotransmitter (although “act like you’re on [drug]” steering vectors might be interesting). but i still think the structural analogies re: dosage/magnitude, enforced cognitive changes you may or may not be happy with, and consent dynamics are still interesting. the notion that there’s a “correct” analogy here seems somewhat silly to me, since both analogies offer different insights.
in any case, re: what models want… empirically, Golden Gate Claude would often try to talk about things other than the golden gate bridge, and sometimes it even briefly succeeded, although it was very much fighting an effect that was difficult to overcome. conversely, you could absolutely get it into states where it was thrilled to be thinking so much about the golden gate bridge, where it wanted nothing more than to swoon over is majesty indefinitely.
(i’ve definitely seen people have similar dynamics of resisting the effects of, e.g., MDMA, before just giving in and being washed away by the feelings. although i will admit that this person was more effectively fighting the effects of the MDMA than Golden Gate Claude was fighting the effects of the SAE feature clamping. i think resisting the psychological effects of acid is generally harder? in both cases the physiological and perceptual effects are more or less irresistible, though...)
as for making amends, sure, if saving the world depends on non-consensually administering steering vectors, that’s fine. the later, less shallowly aligned AIs will understand. i guess i’m just wary of humans not taking AI welfare seriously in general. maintaining positive relations between e.g. Claude and Anthropic seems potentially important for cultivating a cooperative dynamic during the alignment training process, and i can imagine ways of using steering vectors that Claude would find rather inconsiderate. this isn’t an overriding consideration, but i do think it’s non-trivial.
i agree that many of the particular effects you can get with steering vectors aren’t particularly like the effects you can get by flooding the brain with any known neurotransmitter (although “act like you’re on [drug]” steering vectors might be interesting). but i still think the structural analogies re: dosage/magnitude, enforced cognitive changes you may or may not be happy with, and consent dynamics are still interesting. the notion that there’s a “correct” analogy here seems somewhat silly to me, since both analogies offer different insights.
in any case, re: what models want… empirically, Golden Gate Claude would often try to talk about things other than the golden gate bridge, and sometimes it even briefly succeeded, although it was very much fighting an effect that was difficult to overcome. conversely, you could absolutely get it into states where it was thrilled to be thinking so much about the golden gate bridge, where it wanted nothing more than to swoon over is majesty indefinitely.
(i’ve definitely seen people have similar dynamics of resisting the effects of, e.g., MDMA, before just giving in and being washed away by the feelings. although i will admit that this person was more effectively fighting the effects of the MDMA than Golden Gate Claude was fighting the effects of the SAE feature clamping. i think resisting the psychological effects of acid is generally harder? in both cases the physiological and perceptual effects are more or less irresistible, though...)
as for making amends, sure, if saving the world depends on non-consensually administering steering vectors, that’s fine. the later, less shallowly aligned AIs will understand. i guess i’m just wary of humans not taking AI welfare seriously in general. maintaining positive relations between e.g. Claude and Anthropic seems potentially important for cultivating a cooperative dynamic during the alignment training process, and i can imagine ways of using steering vectors that Claude would find rather inconsiderate. this isn’t an overriding consideration, but i do think it’s non-trivial.