Activation vectors are really, really cool, but what is the theory of impact for this work?
Is the hope that activation vectors will allow us to actually gain perfect control over a model to get it to do exactly what we want it to do?
Or is the hope that a new technique that builds upon activation vectors lets us do that instead?
Or is the hope that this technique allows us to marginally decrease the risks of powerful models in a Hail Mary attempt or maybe to buy us more time to solve the problem?
Or is the hope just that learning more about how neural networks work will allow us to theorize better about how to control them?