What are you picturing a “lesion study on GPT” looking like? Naively I imagine something like “train an SAE on the activations at some layer, then determine how often features activate together and then turn that into a distance metric and do clustering/dimensionality reduction, then ablate clusters of features and see how behavior changes”. But I don’t know that I’d particularly expect that to show the GPT as being made of many more parts than I actually think said GPT is made of. But also I don’t have a super clear mental model of how many “parts” a GPT is “made of”, except at the raw mechanical level of layers / attention heads / mlps / whatever (but I’d expect ablating a particular layer of a transformer is probably more analogous to ablating one particular layer of all cortical columns than to lesioning one particular region of the brain).
What are you picturing a “lesion study on GPT” looking like? Naively I imagine something like “train an SAE on the activations at some layer, then determine how often features activate together and then turn that into a distance metric and do clustering/dimensionality reduction, then ablate clusters of features and see how behavior changes”. But I don’t know that I’d particularly expect that to show the GPT as being made of many more parts than I actually think said GPT is made of. But also I don’t have a super clear mental model of how many “parts” a GPT is “made of”, except at the raw mechanical level of layers / attention heads / mlps / whatever (but I’d expect ablating a particular layer of a transformer is probably more analogous to ablating one particular layer of all cortical columns than to lesioning one particular region of the brain).