Fascinating! I was excited when Goodfire’s API came out (to the point of applying to work there), but have since been unable to take the time to explore this in more detail, so it’s nice to read about someone doing so.
A few quick comments:
An interesting analogy to steering generative vision diffusion models with LORAs: too little steering doesn’t produce a sufficient and consistent effect, too much degrade image quality and coherence, the challenge is to try to find a sweet spot that gives you sufficiently consistent results without much degradation of image quality. This often involves searching a space of multiple metaparameters — something that there are good automated systems for.
Another possibility with the SAE control approach is steer not by applying activations to the model, but by monitoring the activations of the model during repeated generation and doing a filtering or best-of-N approach (either wanting particular activations to happen, or not happen). This has additional computation cost from retries, but should have much more limited effects on output coherence.
The effects of a prompt can be overcome by further prompting influence in the input data, such as prompt-injection attacks. One of the hopes for activation engineering approaches is that it’s more ongoing and could thus be more resistant to things like prompt injection. It would be interesting to try to gather clear evidence for or against this.
Fascinating! I was excited when Goodfire’s API came out (to the point of applying to work there), but have since been unable to take the time to explore this in more detail, so it’s nice to read about someone doing so.
A few quick comments:
An interesting analogy to steering generative vision diffusion models with LORAs: too little steering doesn’t produce a sufficient and consistent effect, too much degrade image quality and coherence, the challenge is to try to find a sweet spot that gives you sufficiently consistent results without much degradation of image quality. This often involves searching a space of multiple metaparameters — something that there are good automated systems for.
Another possibility with the SAE control approach is steer not by applying activations to the model, but by monitoring the activations of the model during repeated generation and doing a filtering or best-of-N approach (either wanting particular activations to happen, or not happen). This has additional computation cost from retries, but should have much more limited effects on output coherence.
The effects of a prompt can be overcome by further prompting influence in the input data, such as prompt-injection attacks. One of the hopes for activation engineering approaches is that it’s more ongoing and could thus be more resistant to things like prompt injection. It would be interesting to try to gather clear evidence for or against this.