It’s still unclear to me how well interpretability can scale and solve the core problems in superintelligence alignment, but this felt like a good/healthy incremental advance. I appreciated the exploration of feature splitting, beginnings of testing for universality, and discussion of the team’s update against architectural approaches. I found this remark at the end interesting:
Finally, we note that in some of these expanded theories of superposition, finding the “correct number of features” may not be well-posed. In others, there is a true number of features, but getting it exactly right is less essential because we “fail gracefully”, observing the “true features” at resolutions of different granularity as we increase the number of learned features in the autoencoder.
I don’t think I’ve parsed the paper well enough to confidently endorse how well the paper justifies its claims, but it seemed to pass a bar that was worth paying attention to, for people tracking progress in the interpretability field.
One comment I’d note is that I’d have liked more information about how the feature interpretability worked in progress. The description is fairly vague. When reading this paragraph:
We see that features are substantially more interpretable than neurons. Very subjectively, we found features to be quite interpretable if their rubric value was above 8. The median neuron scored 0 on our rubric, indicating that our annotator could not even form a hypothesis of what the neuron could represent! Whereas the median feature interval scored a 12, indicating that the annotator had a confident, specific, consistent hypothesis that made sense in terms of the logit output weights.
I certainly can buy that the features were a lot more interpretable than individual neurons, but I’m not sure, at the end of the day, how useful the interpretations of features were in absolute terms.
Curated.
It’s still unclear to me how well interpretability can scale and solve the core problems in superintelligence alignment, but this felt like a good/healthy incremental advance. I appreciated the exploration of feature splitting, beginnings of testing for universality, and discussion of the team’s update against architectural approaches. I found this remark at the end interesting:
I don’t think I’ve parsed the paper well enough to confidently endorse how well the paper justifies its claims, but it seemed to pass a bar that was worth paying attention to, for people tracking progress in the interpretability field.
One comment I’d note is that I’d have liked more information about how the feature interpretability worked in progress. The description is fairly vague. When reading this paragraph:
I certainly can buy that the features were a lot more interpretable than individual neurons, but I’m not sure, at the end of the day, how useful the interpretations of features were in absolute terms.