I really enjoyed using the Observability Interface, especially the way it clusters top neuron activations. I have explored debugging model letter counting errors using SAE for a while, and finally got it working in Transluce! It turns out it has a similar shape to fixing the 9.8, 9.11 comparison mistake.
Here are the steps I took.
Asked “How many rs are in berry?” → Llama said ZERO
↓Used the Observability Interface to visualize neuron activations
↓Found clusters related to Currency/Indian cities (rupee Rs)
↓Suppressed those neurons
Llama correctly answered: “There are 2 Rs in berry”
I really enjoyed using the Observability Interface, especially the way it clusters top neuron activations. I have explored debugging model letter counting errors using SAE for a while, and finally got it working in Transluce! It turns out it has a similar shape to fixing the 9.8, 9.11 comparison mistake.
Here are the steps I took.
↓Used the Observability Interface to visualize neuron activations
↓Found clusters related to Currency/Indian cities (rupee Rs)
↓Suppressed those neurons
See screenshots here: https://x.com/FirebirdWen/status/1904285942095213015