The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they’re working with small models and adding a handful of SAE calls to a forward pass shouldn’t be too big a hit.)
This uses transformers, which is IIUC way less efficient for inference than e.g. vllm, to an extent that is probably unacceptable for production usecases.
I wonder if it would be possible to do SAE feature amplification / ablation, at least for residual stream features, by inserting a “mostly empty” layer. E,g, for feature ablation, setting the W_O and b_O params of the attention heads of your inserted layer to 0 to make sure that the attention heads don’t change anything, and then approximate the constant / clamping intervention from the blog post via the MLP weights (if the activation function used for the transformer is the same one as is used for the SAE, it should be possible to do a perfect approximation using only one of the MLP neurons, but even if not it should be possible to very closely approximate any commonly-used activation function using any other commonly-used activation function with some clever stacking).
This would of course be horribly inefficient from a compute perspective (each forward pass would take n+kn times as long, where n is the original number of layers the model had and k is the number of distinct layers in which you’re trying to do SAE operations on the residual stream), but I think vllm would handle “llama but with one extra layer” without requiring any tricky inference code changes and plausibly this would still be more efficient than resampling.
The forward hook for our best performing approach is here. As Sam mentioned, this hasn’t been deployed to production. We left it as a case study because Benchify is currently prioritizing other parts of their stack unrelated to ML.
For this demonstration, we added a forward hook to a HuggingFace Transformers model for simplicity, rather than incorporating it into a production inference stack.
The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they’re working with small models and adding a handful of SAE calls to a forward pass shouldn’t be too big a hit.)
This uses
transformers
, which is IIUC way less efficient for inference than e.g.vllm
, to an extent that is probably unacceptable for production usecases.I wonder if it would be possible to do SAE feature amplification / ablation, at least for residual stream features, by inserting a “mostly empty” layer. E,g, for feature ablation, setting the
W_O
andb_O
params of the attention heads of your inserted layer to 0 to make sure that the attention heads don’t change anything, and then approximate the constant / clamping intervention from the blog post via the MLP weights (if the activation function used for the transformer is the same one as is used for the SAE, it should be possible to do a perfect approximation using only one of the MLP neurons, but even if not it should be possible to very closely approximate any commonly-used activation function using any other commonly-used activation function with some clever stacking).This would of course be horribly inefficient from a compute perspective (each forward pass would take n+kn times as long, where n is the original number of layers the model had and k is the number of distinct layers in which you’re trying to do SAE operations on the residual stream), but I think vllm would handle “llama but with one extra layer” without requiring any tricky inference code changes and plausibly this would still be more efficient than resampling.
The forward hook for our best performing approach is here. As Sam mentioned, this hasn’t been deployed to production. We left it as a case study because Benchify is currently prioritizing other parts of their stack unrelated to ML.
For this demonstration, we added a forward hook to a HuggingFace Transformers model for simplicity, rather than incorporating it into a production inference stack.