ValueShift Research comments on Hello, World of Mechanistic Interpetability

ValueShift Research 17 Mar 2026 14:10 UTC
1 point
0
“This makes me wonder if your 5–8 dimension estimate is depth-dependent” – we think that might be true. We used linear probes with mean-difference vectors and also found out that the result is domain-dependent. A family of correlated directions seems to be a reasonable suggestion.
Our dataset was rather small, though, so we are working on running the same experiment with more data to see if the results reproduce.

Our models were also small: Qwen3-0.6B and Llama-3.1-8B. Our results might be similar to yours, which is interesting – our most prominent layers are 19-23. One caveat is, we pruned the last 20% of layes based on the works showing that earlier layers embed more abstract concepts and later layers embed exact tokens. We might relax this assumption and run more tests on all layers.
Dimensionality seems to vary from layer to layer, which is intuitively expected, but we want to obtain stronger evidence before claiming it.

We used residual streams.