A very interesting idea. But how would you then construct steering vectors for let’s say politeness, refusal or some biases?
I think for those cases you’re better off using standard methods (multiple choice etc.), this technique is only useful when paired positive negative data is more difficult to create (like writing imitation).
A very interesting idea. But how would you then construct steering vectors for let’s say politeness, refusal or some biases?
I think for those cases you’re better off using standard methods (multiple choice etc.), this technique is only useful when paired positive negative data is more difficult to create (like writing imitation).