Very cool! Mainly I worry about whether EP regions (or activation clustering in general) can actually find feature subspaces, in the sense that for concept X, every activation expressing X clusters here, and nothing in here fails to express X.
In terms of retrieving representative examples for a concept or identifying causal interventions, we have cheap supervised ways—ex. making contrastive steering vectors. Do we necessarily care about unsupervised methods here if they don’t outperform? The value of unsupervised clustering seems to be finding feature subspaces to me.
I thus worry about not encouraging sparsity / some equivalent. Without sparsity, you might get one region capturing happiness ∧ yellow ∧ sunshine and another capturing happiness ∧ delicious ∧ tasty. Both separate happiness from sadness, but neither is a “happiness subspace” since the concept is split across regions.
Things I’m curious about:
Would love to just inspect the regions and see what’s clustered together
How good is mean activation / examplar for any arbitrary region at separating positive and contrastive examples of a concept? Currently the AxBench results select the best region, which only shows the existence of a good region, not that every region is good
For each best EP region selected for an AxBench concept, collect a ton of new activations related to the concept and see if they are < 2p from each activation in the region
What is the TPR/TNR on AxBench using the threshold p? Since region membership is hard yes/no when actually seeing which region a new activation falls into, AUROC feels less relevant
Curious about if/how these regions line up with stable regions of activation space: https://arxiv.org/abs/2409.17113
Do regions with lower average pairwise distance have other nice properties—eg. higher purity of a concept. Do regions with higher average pairwise distance have higher coverage of a concept?
How do other clustering methods perform on the purity/coverage axis—ex. https://arxiv.org/pdf/2510.15987v1 tried k-means and found some interpretable higher-level features; previous work I was part of tried clustering semantic embeddings and found a nice clustering pipeline (albeit much more complex)
Thanks so much for writing this up! Please correct me on anything I’ve gotten wrong here; it’s entirely plausible I missed something or am making obvious conceptual mistakes.
Thanks so much for the detailed response! Certainly agree on the point of human-interpretable features not necessarily describing what the model is doing, and thus the point of using unsupervised methods which impose less assumptions—I think EP is an awesome new one!
I agree the model could have two distinct representations that should be in separate regions. I think sparsity doesn’t mean imposing human-interpretable structure, but instead “each input activates few features” so we’re perhaps more likely to find what the model is actually doing. We don’t want to push for a happiness subspace particularly, but just separable subspaces in general even if they represent very complex concepts. I’d be curious about trying to make each input close to only a few exemplars, since they could be roughly equidistant to many. This has many tradeoffs, but maybe you could do something like keeping the join threshold at p, but requiring new exemplars to be > kp from all existing leaders for some k > 1? Then each input activation is close to at most one exemplar and far from the rest. Activations in the gray zone between p and kp would be skipped when building the dictionary but could be assigned at inference. Since EP computes the full distance vector to decide whether an input should be an exemplar, this wouldn’t be much more expensive.
Awesome! Is there a place where people are discussing EP, maybe an OSMI Slack channel would be good for this?