Their results are for document embeddings (which are often derived from LLMs), not internal activation spaces in LLMs. But I suspect if we tested their method for internal activation spaces of different LLMs, at least ones of similar sizes and architectures, then we might find similar results. Someone really should test this, and publish the paper: it should be pretty easy to replicate what they did and plug various LLM embeddings in.
If that turns out to be true, to a significant extent, this seems like it should be quite useful for:
a) understanding why jailbreaks often transfer fairly well between models
b) supporting ideas around natural representations
c) letting you do various forms of interpretability in one model and then searching for similar circuits/embeddings/SAE features in other models
d) extending technique like the logit lens
e) comparing and translating between LLM’s internal embedding spaces and the latent space inherent in human language (their result clearly demonstrates that the is a latent space inherent in human language). This is a significant chunk of the entire interpretability problem: it lest us see inside the black box, so that’s a pretty key capability.
f) if you have a translation between two models (say of their activation vectors at their midpoint layer), then by comparing roundtripping from model A to model B and back to just roundtripping from model A to the shared latent space and back, you can identify what concepts model A understands that model B doesn’t. Similarly in the other direction. That seems like a very useful ability.
Of course, their approach requires zero information about which embeddings for model A correspond to or are similar to which embeddings for model B: their translation model learns all that from patterns in the data — rather well, according to their results. However, it shouldn’t be hard to supplement their approach, given that you often do have partial information about this, and have it also make use of the structures inherent in the data.
Their results are for document embeddings (which are often derived from LLMs), not internal activation spaces in LLMs. But I suspect if we tested their method for internal activation spaces of different LLMs, at least ones of similar sizes and architectures, then we might find similar results. Someone really should test this, and publish the paper: it should be pretty easy to replicate what they did and plug various LLM embeddings in.
If that turns out to be true, to a significant extent, this seems like it should be quite useful for:
a) understanding why jailbreaks often transfer fairly well between models
b) supporting ideas around natural representations
c) letting you do various forms of interpretability in one model and then searching for similar circuits/embeddings/SAE features in other models
d) extending technique like the logit lens
e) comparing and translating between LLM’s internal embedding spaces and the latent space inherent in human language (their result clearly demonstrates that the is a latent space inherent in human language). This is a significant chunk of the entire interpretability problem: it lest us see inside the black box, so that’s a pretty key capability.
f) if you have a translation between two models (say of their activation vectors at their midpoint layer), then by comparing roundtripping from model A to model B and back to just roundtripping from model A to the shared latent space and back, you can identify what concepts model A understands that model B doesn’t. Similarly in the other direction. That seems like a very useful ability.
Of course, their approach requires zero information about which embeddings for model A correspond to or are similar to which embeddings for model B: their translation model learns all that from patterns in the data — rather well, according to their results. However, it shouldn’t be hard to supplement their approach, given that you often do have partial information about this, and have it also make use of the structures inherent in the data.