Ihor Kendiukhov comments on Extracting Performant Algorithms Using Mechanistic Interpretability

Ihor Kendiukhov 14 Mar 2026 17:05 UTC
3 points
0
I adjusted the text about the Goodfire discovery a bit, so it hopefull addresses your main concern which is likely valid.
But more generally, the entire thing about that discovery is that manifold is compact and can be detected, not an abstract realization that the model somehow represents distances. I think it is evidence towards the idea that model does use these representations. In any case, that’s a finding about how foundation models organize features geometrically, and it has methodological implications for interpretability under any interpretation of what the features themselves represent.
Regardless, the Goodfire result is more for storytelling in my post, not load-bearing evidence.
- Kaarel 14 Mar 2026 18:04 UTC
  5 points
  1
  Parent
  But more generally, the entire thing about that discovery is that manifold is compact and can be detected, not an abstract realization that the model somehow represents distances. I think it is evidence towards the idea that model does use these representations. In any case, that’s a finding about how foundation models organize features geometrically, and it has methodological implications for interpretability under any interpretation of what the features themselves represent.
  
  I agree this is evidence that the model uses these representations ^[1] , I just think it is weak evidence, because you’d expect the same thing to show up even if the model doesn’t use these representations. (And given that it is weak evidence, we roughly speaking have to ignore it and stick with our priors from other considerations when judging whether the model uses these representations.) I don’t just think that the manifold existing isn’t that meaningful, I also think the manifold being “compact” ^[2] or detectable isn’t that meaningful either. In my toy naive DNA embedding example, I think one can read off phylogenic distances from projections of points to a completely random low-dimensional subspace. ^[3] When I’m playing a sufficiently high-dimensional video game, I’d guess that there is a simple probe on RAM activations which distinguishes with accuracy whether I’m really tired vs not at all tired. I’m not that familiar with how RAM positions are allocated to variables typically, but I’d pretty strongly guess that usually you can do this with a linear probe in fact, i.e. that there is a one-dimensional subspace of my video game program’s activations which captures this property. I would also bet that such a subspace is findable by gradient descent from a random seed. ^[4] And yet: the naive DNA representation doesn’t use phylogeny at all, and my video game program doesn’t think about whether I’m tired at all.
  1. ↩︎
    well, because if the method gave a result worse than this, that would be evidence that the model doesn’t use these representations
  2. ↩︎
    btw I’m guessing you’re just pointing at the manifold being a low-dimensional subspace?
  3. ↩︎
    Roughly speaking, as long as one can read off phylogenic distances from distances in the big space of dimension , one will also be able to read off distances in a random subspace of dimension , by Johnson-Lindenstrauss. One could claim that the dimension of the subspace one finds in practice is smaller than what one would expect from mechanisms of the sort I’m proposing, but I strongly doubt anyone in this epistemic pipeline has thought this sort of thing through.
  4. ↩︎
    It isn’t quite linear regression, but it’s close to it. And linear regression is a convex optimization problem, so (roughly speaking) gradient descent works.