individual neurons are polysemantic; not great for interpretability. we need larger interpretable “circuits” or chunks that have a definite purpose.
“cross-layer transcoders” read from one layer and write to all following ones; they are meant to replicate the output of the multi-layer perceptron of the input layer & residual stream. the CLT’s feature activations are given by the inner product of the CLT encoder matrix and the residual stream activations at layer l. but we also have the decoder matrix so we can run this in reverse and get an estimate of the original model’s MLP at any layer.
then you just train the model with the CLTs, all the while trying to minimize a loss given by the squared errors between estimated & actual MLPs, and a sparsity penalty for the CLT matrix’s feature activations
we can replace the original network with a network made of features (rather than the original neurons) which are sparsely encoded (because we asked them to be.) the sparsity means we’re getting rid of polysemanticity and the new neurons “mean” more uniquely-defined things.
new network:
Its input is the concatenated set of one-hot vectors for each token in the prompt.
Its neurons are the union of the CLT features active at every token position.
Its weights are the summed interactions over all the linear paths from one feature to another, including via the residual stream and through attention, but not passing through MLP or CLT layers. Because attention patterns and normalization denominators are frozen, the impact of a source feature’s activation on a target feature’s pre-activation via each path is linear in the activation of the source feature. We sometimes refer to these as “virtual weights” because they are not instantiated in the underlying model.
Additionally, it has bias-like nodes corresponding to error terms, with a connection from each bias to each downstream neuron in the model.
from this they build an “attribution graph”—the input string is encoded into tokens, each token can “flow” towards one or more neurons, the weights tell you which paths are important, and you can observe that neurons do things like “continue this acronym” or “say d”
features can be associated with words, text placements like “numbers ending in 6” or “first two letters of an acronym”, or concepts (like sports)
there’s also “say [string]” features that activate immediately preceding [string]
you can test interpretations by replacing a layer with the feature decoder equivalent, and suppressing given features; this changes what the LLM outputs in predictable ways.
individual neurons are polysemantic; not great for interpretability. we need larger interpretable “circuits” or chunks that have a definite purpose.
“cross-layer transcoders” read from one layer and write to all following ones; they are meant to replicate the output of the multi-layer perceptron of the input layer & residual stream. the CLT’s feature activations are given by the inner product of the CLT encoder matrix and the residual stream activations at layer l. but we also have the decoder matrix so we can run this in reverse and get an estimate of the original model’s MLP at any layer.
then you just train the model with the CLTs, all the while trying to minimize a loss given by the squared errors between estimated & actual MLPs, and a sparsity penalty for the CLT matrix’s feature activations
we can replace the original network with a network made of features (rather than the original neurons) which are sparsely encoded (because we asked them to be.) the sparsity means we’re getting rid of polysemanticity and the new neurons “mean” more uniquely-defined things.
new network:
Its input is the concatenated set of one-hot vectors for each token in the prompt.
Its neurons are the union of the CLT features active at every token position.
Its weights are the summed interactions over all the linear paths from one feature to another, including via the residual stream and through attention, but not passing through MLP or CLT layers. Because attention patterns and normalization denominators are frozen, the impact of a source feature’s activation on a target feature’s pre-activation via each path is linear in the activation of the source feature. We sometimes refer to these as “virtual weights” because they are not instantiated in the underlying model.
Additionally, it has bias-like nodes corresponding to error terms, with a connection from each bias to each downstream neuron in the model.
from this they build an “attribution graph”—the input string is encoded into tokens, each token can “flow” towards one or more neurons, the weights tell you which paths are important, and you can observe that neurons do things like “continue this acronym” or “say d”
features can be associated with words, text placements like “numbers ending in 6” or “first two letters of an acronym”, or concepts (like sports)
there’s also “say [string]” features that activate immediately preceding [string]
you can test interpretations by replacing a layer with the feature decoder equivalent, and suppressing given features; this changes what the LLM outputs in predictable ways.
links 4/4/25: https://roamresearch.com/#/app/srcpublic/page/04-04-2025
https://www.science.org/doi/10.1126/science.adr3675
thalamic nuclei are involved in consciousness, active in a task that distinguishes conscious from unconscious visual perception
https://www.precigenetics.com/ nondestructive image-based epigenetics of single cells. how do they do it??? founded by Parmita Mishra
https://transformer-circuits.pub/2025/attribution-graphs/methods.html
individual neurons are polysemantic; not great for interpretability. we need larger interpretable “circuits” or chunks that have a definite purpose.
“cross-layer transcoders” read from one layer and write to all following ones; they are meant to replicate the output of the multi-layer perceptron of the input layer & residual stream. the CLT’s feature activations are given by the inner product of the CLT encoder matrix and the residual stream activations at layer l. but we also have the decoder matrix so we can run this in reverse and get an estimate of the original model’s MLP at any layer.
then you just train the model with the CLTs, all the while trying to minimize a loss given by the squared errors between estimated & actual MLPs, and a sparsity penalty for the CLT matrix’s feature activations
we can replace the original network with a network made of features (rather than the original neurons) which are sparsely encoded (because we asked them to be.) the sparsity means we’re getting rid of polysemanticity and the new neurons “mean” more uniquely-defined things.
new network:
Its input is the concatenated set of one-hot vectors for each token in the prompt.
Its neurons are the union of the CLT features active at every token position.
Its weights are the summed interactions over all the linear paths from one feature to another, including via the residual stream and through attention, but not passing through MLP or CLT layers. Because attention patterns and normalization denominators are frozen, the impact of a source feature’s activation on a target feature’s pre-activation via each path is linear in the activation of the source feature. We sometimes refer to these as “virtual weights” because they are not instantiated in the underlying model.
Additionally, it has bias-like nodes corresponding to error terms, with a connection from each bias to each downstream neuron in the model.
from this they build an “attribution graph”—the input string is encoded into tokens, each token can “flow” towards one or more neurons, the weights tell you which paths are important, and you can observe that neurons do things like “continue this acronym” or “say d”
features can be associated with words, text placements like “numbers ending in 6” or “first two letters of an acronym”, or concepts (like sports)
there’s also “say [string]” features that activate immediately preceding [string]
you can test interpretations by replacing a layer with the feature decoder equivalent, and suppressing given features; this changes what the LLM outputs in predictable ways.
https://www.science.org/doi/10.1126/science.adr3675 [[neuroscience]]
thalamic nuclei are involved in [[consciousness]], active in a task that distinguishes conscious from unconscious visual perception
https://www.precigenetics.com/ nondestructive image-based epigenetics of single cells. how do they do it??? founded by [[Parmita Mishra]]
https://transformer-circuits.pub/2025/attribution-graphs/methods.html [[mechanistic interpretability]] [[AI]]
individual neurons are polysemantic; not great for interpretability. we need larger interpretable “circuits” or chunks that have a definite purpose.
“cross-layer transcoders” read from one layer and write to all following ones; they are meant to replicate the output of the multi-layer perceptron of the input layer & residual stream. the CLT’s feature activations are given by the inner product of the CLT encoder matrix and the residual stream activations at layer l. but we also have the decoder matrix so we can run this in reverse and get an estimate of the original model’s MLP at any layer.
then you just train the model with the CLTs, all the while trying to minimize a loss given by the squared errors between estimated & actual MLPs, and a sparsity penalty for the CLT matrix’s feature activations
we can replace the original network with a network made of features (rather than the original neurons) which are sparsely encoded (because we asked them to be.) the sparsity means we’re getting rid of polysemanticity and the new neurons “mean” more uniquely-defined things.
new network:
Its input is the concatenated set of one-hot vectors for each token in the prompt.
Its neurons are the union of the CLT features active at every token position.
Its weights are the summed interactions over all the linear paths from one feature to another, including via the residual stream and through attention, but not passing through MLP or CLT layers. Because attention patterns and normalization denominators are frozen, the impact of a source feature’s activation on a target feature’s pre-activation via each path is linear in the activation of the source feature. We sometimes refer to these as “virtual weights” because they are not instantiated in the underlying model.
Additionally, it has bias-like nodes corresponding to error terms, with a connection from each bias to each downstream neuron in the model.
from this they build an “attribution graph”—the input string is encoded into tokens, each token can “flow” towards one or more neurons, the weights tell you which paths are important, and you can observe that neurons do things like “continue this acronym” or “say d”
features can be associated with words, text placements like “numbers ending in 6” or “first two letters of an acronym”, or concepts (like sports)
there’s also “say [string]” features that activate immediately preceding [string]
you can test interpretations by replacing a layer with the feature decoder equivalent, and suppressing given features; this changes what the LLM outputs in predictable ways.