I think this hypothesis might benefit from more rigorous evaluation. To be clear, I think it’s entirely possible that you could be right, but we need more evidence that more sophisticated methods won’t work. Your tests were with the simplest version of steering, something akin to 14th-century surgery. I think we need much more rigorous evidence from mech interp.
For example, perhaps the issue is merely that larger models have more complex representation manifolds, and thus simple additive steering pushes activations off-manifold. Or plausibly we need better steering vectors—I think that it’s likely we’ll have much more sophisticated things to steer with than sparse dictionary learning features (and indeed many such things are already in the literature).
In other words, I think you need to show some reason why our steering methods’ sophistication can’t evolve with model scale and complexity.
Yes, I would also like to see a more rigorous examination of better steering methods. This post was intended to point out the smoke, rather than actually fight the fire.
If we actually can make better steering methods as we go (we might call this “pants steering”) then this is good, but it changes the dynamic to one where the success of this system relies on being able to keep finding better and better steering methods, and also putting in the effort to do so. This is a worse dynamic to be in than one where we’ve “solved” steering and can spend our time working on other problems.
I think this hypothesis might benefit from more rigorous evaluation. To be clear, I think it’s entirely possible that you could be right, but we need more evidence that more sophisticated methods won’t work. Your tests were with the simplest version of steering, something akin to 14th-century surgery. I think we need much more rigorous evidence from mech interp.
For example, perhaps the issue is merely that larger models have more complex representation manifolds, and thus simple additive steering pushes activations off-manifold. Or plausibly we need better steering vectors—I think that it’s likely we’ll have much more sophisticated things to steer with than sparse dictionary learning features (and indeed many such things are already in the literature).
In other words, I think you need to show some reason why our steering methods’ sophistication can’t evolve with model scale and complexity.
Yes, I would also like to see a more rigorous examination of better steering methods. This post was intended to point out the smoke, rather than actually fight the fire.
If we actually can make better steering methods as we go (we might call this “pants steering”) then this is good, but it changes the dynamic to one where the success of this system relies on being able to keep finding better and better steering methods, and also putting in the effort to do so. This is a worse dynamic to be in than one where we’ve “solved” steering and can spend our time working on other problems.