Can SAE feature steering improve performance on some downstream task? I tried using Goodfire’s Ember API to improve Llama 3.3 70b performance on MMLU. Full report and code is available here.
SAE feature steering reduces performance. It’s not very clear why at the moment, and I don’t have time to do further digging right now. If I get time later this week I’ll try visualizing which SAE features get used / building intuition by playing around in the Goodfire playground. Maybe trying a different task or improving the steering method would work also better.
There may also be some simple things I’m not doing right. (I only spent ~1 day on this, and part of that was engineering rather than research iteration). Keen for feedback. Also welcome people to play around with my code—I believe I’ve made it fairly easy to run
Can you say something about the features you selected to steer toward? I know you say you’re finding them automatically based on a task description, but did you have a sense of the meaning of the features you used? I don’t know whether GoodFire includes natural language descriptions of the features or anything but if so what were some representative ones?
Can SAE feature steering improve performance on some downstream task? I tried using Goodfire’s Ember API to improve Llama 3.3 70b performance on MMLU. Full report and code is available here.
SAE feature steering reduces performance. It’s not very clear why at the moment, and I don’t have time to do further digging right now. If I get time later this week I’ll try visualizing which SAE features get used / building intuition by playing around in the Goodfire playground. Maybe trying a different task or improving the steering method would work also better.
There may also be some simple things I’m not doing right. (I only spent ~1 day on this, and part of that was engineering rather than research iteration). Keen for feedback. Also welcome people to play around with my code—I believe I’ve made it fairly easy to run
Can you say something about the features you selected to steer toward? I know you say you’re finding them automatically based on a task description, but did you have a sense of the meaning of the features you used? I don’t know whether GoodFire includes natural language descriptions of the features or anything but if so what were some representative ones?