Riya Tyagi

Karma: 94

Riya Tyagi 18 May 2026 9:40 UTC
7 points
0
in reply to: Jessica Rumbelow’s comment on: An Introduction to Exemplar Partitioning for Mechanistic Interpretability
Thanks so much for the detailed response! Certainly agree on the point of human-interpretable features not necessarily describing what the model is doing, and thus the point of using unsupervised methods which impose less assumptions—I think EP is an awesome new one!
Your happiness example is a really good one – consider that it could just actually be the case that the model has learned two different representations of happiness, and that enforcing sparsity (i.e. pushing for a discrete happiness subspace, because that is what makes sense to us) is imposing an assumption that makes the feature look better to us, but might not represent what the model is actually doing.
I agree the model could have two distinct representations that should be in separate regions. I think sparsity doesn’t mean imposing human-interpretable structure, but instead “each input activates few features” so we’re perhaps more likely to find what the model is actually doing. We don’t want to push for a happiness subspace particularly, but just separable subspaces in general even if they represent very complex concepts. I’d be curious about trying to make each input close to only a few exemplars, since they could be roughly equidistant to many. This has many tradeoffs, but maybe you could do something like keeping the join threshold at p, but requiring new exemplars to be > kp from all existing leaders for some k > 1? Then each input activation is close to at most one exemplar and far from the rest. Activations in the gray zone between p and kp would be skipped when building the dictionary but could be assigned at inference. Since EP computes the full distance vector to decide whether an input should be an exemplar, this wouldn’t be much more expensive.
These are really excellent questions that I don’t know the answer to yet. I am right this moment working on making the github repo as friendly as possible to other researchers, so would encourage you to get stuck in if you’re keen!
Awesome! Is there a place where people are discussing EP, maybe an OSMI Slack channel would be good for this?

Riya Tyagi 17 May 2026 1:42 UTC
7 points
0
on: An Introduction to Exemplar Partitioning for Mechanistic Interpretability
Very cool! Mainly I worry about whether EP regions (or activation clustering in general) can actually find feature subspaces, in the sense that for concept X, every activation expressing X clusters here, and nothing in here fails to express X.
In terms of retrieving representative examples for a concept or identifying causal interventions, we have cheap supervised ways—ex. making contrastive steering vectors. Do we necessarily care about unsupervised methods here if they don’t outperform? The value of unsupervised clustering seems to be finding feature subspaces to me.
I thus worry about not encouraging sparsity / some equivalent. Without sparsity, you might get one region capturing happiness ∧ yellow ∧ sunshine and another capturing happiness ∧ delicious ∧ tasty. Both separate happiness from sadness, but neither is a “happiness subspace” since the concept is split across regions.
Things I’m curious about:
- Would love to just inspect the regions and see what’s clustered together
- How good is mean activation / examplar for any arbitrary region at separating positive and contrastive examples of a concept? Currently the AxBench results select the best region, which only shows the existence of a good region, not that every region is good
- For each best EP region selected for an AxBench concept, collect a ton of new activations related to the concept and see if they are < 2p from each activation in the region
- What is the TPR/TNR on AxBench using the threshold p? Since region membership is hard yes/no when actually seeing which region a new activation falls into, AUROC feels less relevant
- Curious about if/how these regions line up with stable regions of activation space: https://arxiv.org/abs/2409.17113
- Do regions with lower average pairwise distance have other nice properties—eg. higher purity of a concept. Do regions with higher average pairwise distance have higher coverage of a concept?
- How do other clustering methods perform on the purity/coverage axis—ex. https://arxiv.org/pdf/2510.15987v1 tried k-means and found some interpretable higher-level features; previous work I was part of tried clustering semantic embeddings and found a nice clustering pipeline (albeit much more complex)
Thanks so much for writing this up! Please correct me on anything I’ve gotten wrong here; it’s entirely plausible I missed something or am making obvious conceptual mistakes.

Test your best methods on our hard CoT interp tasks

daria, Riya Tyagi, Josh Engels and Neel Nanda

26 Mar 2026 19:24 UTC

59 points

2 comments19 min readLW link

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Riya Tyagi, daria, Arthur Conmy and Neel Nanda

13 Jan 2026 20:40 UTC

52 points

0 comments18 min readLW link

Riya Tyagi

Test your best meth­ods on our hard CoT in­terp tasks

Global CoT Anal­y­sis: Ini­tial at­tempts to un­cover pat­terns across many chains of thought

Test your best methods on our hard CoT interp tasks

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought