Why doesn’t algebraic value editing break all kinds of internal computations?! What happened to the “manifold of usual activations”? Doesn’t that matter at all?
Or the hugely nonlinear network architecture, which doesn’t even have a persistent residual stream? Why can I diff across internal activations for different observations?
Why can I just add 10 times the top-right vector and still get roughly reasonable behavior?
And the top-right vector also transfers across mazes? Why isn’t it maze-specific?
To make up some details, why wouldn’t an internal “I want to go to top-right” motivational information be highly entangled with the “maze wall location” information?
This was also the most surprising part of the results to me.
I think both this work and Neel’s recent Othello post do provide evidence that at least for small-medium sized neural networks, things are just… represented ~linearly (Olah et al’s Features as Directions hypothesis). Note that Chris Olah’s earlier work for features as directions were not done on transformers but also on conv nets without residual streams.
Indeed! When I looked into model editing stuff with the end goal of “retargeting the search”, the finickiness and break down of internal computations was the thing that eventually updated me away from continuing to pursue this. I haven’t read these maze posts in detail yet, but the fact that the internal computations don’t ruin the network is surprising and makes me think about spending time again in this direction.
I’d like to eventually think of similar experiments to run with language models. You could have a language model learn how to solve a text adventure game, and try to edit the model in similar ways as these posts, for example.
Edit: just realized that the next post might be with GPT-2. Exciting!
Great work, glad to see it out!
This was also the most surprising part of the results to me.
I think both this work and Neel’s recent Othello post do provide evidence that at least for small-medium sized neural networks, things are just… represented ~linearly (Olah et al’s Features as Directions hypothesis). Note that Chris Olah’s earlier work for features as directions were not done on transformers but also on conv nets without residual streams.
Indeed! When I looked into model editing stuff with the end goal of “retargeting the search”, the finickiness and break down of internal computations was the thing that eventually updated me away from continuing to pursue this. I haven’t read these maze posts in detail yet, but the fact that the internal computations don’t ruin the network is surprising and makes me think about spending time again in this direction.
I’d like to eventually think of similar experiments to run with language models. You could have a language model learn how to solve a text adventure game, and try to edit the model in similar ways as these posts, for example.
Edit: just realized that the next post might be with GPT-2. Exciting!
I think the hyperlink for “conv nets without residual streams” is wrong? It’s https://www.westernunion.com/web/global-service/track-transfer for me
lol, thanks, fixed