[Question] Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Evan R. Murphy26 May 2025 18:20 UTC

43 points

AI Interpretability (ML & AI)Language Models (LLMs)

Rishi Jha, Collin Zhang, Vitaly Shmatikov and John X. Morris published a new paper last week called Harnessing the Universal Geometry of Embeddings.

Abstract of the paper (bold was added by me):

We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets.
The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.

They focus on security implications of their research, but I am trying to understand: Do these findings have major implications for interpretability research?

It seems like discovering a sort of universal structure that is shared among all LLMs would help a lot for understanding the internals of these models. But I may be misunderstanding the nature of the patterns they are translating and corresponding.

What links here?

Evan R. Murphy's comment on E.G. Blee-Goldman’s Shortform by E.G. Blee-Goldman (26 May 2025 18:21 UTC; 3 points)

Evan R. Murphy26 May 2025 18:20 UTC

43 points

6 comments1 min readLW link

AI Interpretability (ML & AI)Language Models (LLMs)

RogerDearnaley 28 May 2025 6:09 UTC
11 points
0
Their results are for document embeddings (which are often derived from LLMs), not internal activation spaces in LLMs. But I suspect if we tested their method for internal activation spaces of different LLMs, at least ones of similar sizes and architectures, then we might find similar results. Someone really should test this, and publish the paper: it should be pretty easy to replicate what they did and plug various LLM embeddings in.

If that turns out to be true, to a significant extent, this seems like it should be quite useful for:
a) understanding why jailbreaks often transfer fairly well between models
b) supporting ideas around natural representations
c) letting you do various forms of interpretability in one model and then searching for similar circuits/embeddings/SAE features in other models
d) extending technique like the logit lens
e) comparing and translating between LLM’s internal embedding spaces and the latent space inherent in human language (their result clearly demonstrates that the is a latent space inherent in human language). This is a significant chunk of the entire interpretability problem: it lest us see inside the black box, so that’s a pretty key capability.
f) if you have a translation between two models (say of their activation vectors at their midpoint layer), then by comparing roundtripping from model A to model B and back to just roundtripping from model A to the shared latent space and back, you can identify what concepts model A understands that model B doesn’t. Similarly in the other direction. That seems like a very useful ability.
Of course, their approach requires zero information about which embeddings for model A correspond to or are similar to which embeddings for model B: their translation model learns all that from patterns in the data — rather well, according to their results. However, it shouldn’t be hard to supplement their approach, given that you often do have partial information about this, and have it also make use of the structures inherent in the data.
Bogdan Ionut Cirstea 27 May 2025 18:42 UTC
8 points
2
Yes, I do think this should be a big deal, and even more so for monitoring (than for understanding model internals). It should also have been at least somewhat predictable, based on theoretical results like those in I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? and in All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling.
quetzal_rainbow 28 May 2025 10:12 UTC
4 points
0
The problem here is that sequence embeddings should have tons of side-channels which should convey non-semantic information (like, say, frequencies of tokens in sequence) and you can come a long way with this sort of information.

What would be really interesting is to train embedding models in different languages and check whether you can translate highly metaphorical sentences with no correspondence other than semantic, or train embedding models on different representations of the same math (for example, matrix mechanics vs wave mechanics formulations of quantum mechanics) and see if they recognize equivalent theorems.
TristanTrim 21 Sep 2025 3:09 UTC
1 point
0
I really want to dive into this paper, but it feels to me like it is a big deal, and is approaching from a valuable different direction what Mingwei found in Toward Comparing DNNs with UMAP Tour, that it is the dataset that determines the latent structures.

I think it is promising for interpretability, but I worry it may lead to dangerous capabilities progress if the ability to work with and design these structures manually is developed. One particular worry is what could be done with chain of though in universal latent space.

No comments.