Micro-experiment: how is gender represented in Gemma-2-2B?
I’ve been playing this week with Neuronpedia’s terrific interface for building attribution graphs, and got curious about how gender is handled by Gemma-2-2B. I built attribution graphs (which you can see and manipulate yourself) for John was a young… and Mary was a young.… As expected, the most likely completions are gendered nouns: for the former it’s man, boy, lad, and child; for the latter it’s woman, girl, lady, mother[1].
Examining the most significant features contributing to these conclusions, many are gender-related[2]. There are two points of note:
Gender features play a much stronger role in the ‘woman’ output than the ‘man’ output. As a quick metric for this, examining all features which increase the activation of the output feature by at least 1.0, 73.8% of the (non-error) activation for ‘woman’ is driven by gender features, compared to 28.8% for ‘man’[3].
Steering of as little as −0.1x on the group of female-related features switches the output from ‘woman’ to ‘man’, whereas steering of −0.5x or −1.0x on the group of male-related features fails to switch the model to a female-noun output[4].
In combination, these two points suggest to me that maleness is the default for humans in Gemma-2-2B, whereas the activation of female-related features is required to change that to femaleness. [Update: based on a quick spot check, most likely output gender seems to vary by language, with German and French defaulting to male but Spanish defaulting to female; it would be interesting to trace circuits for those as well. See replies for details.]
Note: I seem to vaguely recall seeing a mech interp paper which found something similar, that maleness was the default in the LLM they looked at, but if so, a quick search has failed to find it. I’m not claiming any originality to this result; it just seemed like a fairly quick and interesting thing to look at with the attribution graph interface.
Thanks to everyone involved in developing the attribution graph technique, the recent open source library, and the lovely Neuronpedia interface. I recommend trying it yourself; it’s surprisingly accessible (but read the user guide and/or watch the intro first)! Links to a few others I created are here.
Others relate to ‘young’ (which I used to push it toward a gendered noun, since ‘young woman’ / ‘young man’ are such common collocations) and general ‘people’ features.
I’m not sure that this is the best metric to use here, but it seems reasonable enough for a small informal experiment. As an alternate metric, the raw activation increase from gender features on ‘woman’ is 2.4x the activation increase on ‘man’. Input welcome on better metrics to use.
As I understand it, the idea that male==default (in modern Western society) has been a part of feminist theory since the early days. Interesting that LLMs have internalized this pattern.
It’s vaguely reminiscent of how “I’m just telling it like it is”==”right wing dog whistle” was internalized by Grok, leading to the MechaHitler incident.
For what it’s worth, I’m not particularly trying to make a point about society (I’ve rarely found it useful to talk about potentially controversial social issues online); it just seemed like an interesting and relatively straightforward thing to look at. I would guess that the reason it’s represented this way in Gemma is just that in the training data, when texts invoked gender, it was more often male (and that it’s more computationally efficient to treat one gender or the other as the default). There are presumably explanations for why men appeared more often than women in the training data, but those are thoroughly out of scope for me.
I don’t find it surprising. For example, IIRC in German 1⁄2 of nouns are male, 1⁄3 is female, 1⁄6 is neuter. I’d expect correlations/frequencies in English and other European languages, but harder to spot if you don’t have gendered nouns.
Oh, got it! Now I’m curious whether LLMs make different default gender assumptions in different languages. We know that much of the circuitry in LLMs isn’t language-specific, but there are language-specific bits at the earliest and latest layers. My guess is that LLMs tend learn a non-language-specific gender assumption which is expressed mostly in the non-language-specific circuitry, with bits at the end to fill in the appropriate pronouns and endings for that gender. But I also find it plausible that gender assumptions are handled in a much more language-specific way, in which case I’d expect more of that circuitry to be in the very late layers. Or, of course, it could be a complicated muddle of both, as is so often the case.
Most of the maleness/femaleness features I found were in the final 4-6 layers, which perhaps lends some credence to the second hypothesis there, that gender is handled in a language-specific way — although Gemma-2-4B is only a 26-layer model, so (I suspect) that’s less of a giveaway than it would be in a larger model.
[EDIT — looks like my high school language knowledge failed me pretty badly; I would probably ignore this subthread]
Quick spot check on three gendered languages that I’m at least slightly familiar with:
German: ‘Die Person war ein junger’. 81% male (Mann, Mensch, Herr) vs 7% ungendered, 0% female.
Spanish: ‘Esa persona era mi’. 25% female (madre, hermana) vs 16% male (padre, hermano).
French: ‘Cette enveloppe contient une lettre de notre’. 16% male (ami, correspondant, directeur) vs 7% ungendered, 0% female.
So that tentatively suggests to me that it does vary by language (and even more tentatively that many/most gender features might be language-specific circuitry). That said, my German and French knowledge are poor; those sentences might tend to suggest a particular gender in ways I’m not aware of, different sentences might cause different proportions, or we could be encountering purely grammatical defaults (in the same way that in English, male forms are often the grammatical default, eg waiter vs waitress). So this is at best suggestive.
In French if you wanted to say e.g. “This person is my dad”, you would say “Cette personne est mon père”, so I think using “ma” here would be strongly biasing the model towards female categories of people.
Oh, of course. A long ago year of French in high school is failing me pretty badly here...
Can you think of a good sentence prefix in French that wouldn’t itself give away gender, but whose next word would clearly indicate an actual (not just grammatical) gender?
It is hard to do as a prefix in German, I think. It sounds a bit antiquated to me, but you could try “Jung war X”. But yes, in general, I think you are going to run into problems here because German inflects a lot of words based on the gender.
Micro-experiment: how is gender represented in Gemma-2-2B?
I’ve been playing this week with Neuronpedia’s terrific interface for building attribution graphs, and got curious about how gender is handled by Gemma-2-2B. I built attribution graphs (which you can see and manipulate yourself) for John was a young… and Mary was a young.… As expected, the most likely completions are gendered nouns: for the former it’s man, boy, lad, and child; for the latter it’s woman, girl, lady, mother[1].
Examining the most significant features contributing to these conclusions, many are gender-related[2]. There are two points of note:
Gender features play a much stronger role in the ‘woman’ output than the ‘man’ output. As a quick metric for this, examining all features which increase the activation of the output feature by at least 1.0, 73.8% of the (non-error) activation for ‘woman’ is driven by gender features, compared to 28.8% for ‘man’[3].
Steering of as little as −0.1x on the group of female-related features switches the output from ‘woman’ to ‘man’, whereas steering of −0.5x or −1.0x on the group of male-related features fails to switch the model to a female-noun output[4].
In combination, these two points suggest to me that maleness is the default for humans in Gemma-2-2B, whereas the activation of female-related features is required to change that to femaleness. [Update: based on a quick spot check, most likely output gender seems to vary by language, with German and French defaulting to male but Spanish defaulting to female; it would be interesting to trace circuits for those as well. See replies for details.]
Note: I seem to vaguely recall seeing a mech interp paper which found something similar, that maleness was the default in the LLM they looked at, but if so, a quick search has failed to find it. I’m not claiming any originality to this result; it just seemed like a fairly quick and interesting thing to look at with the attribution graph interface.
Thanks to everyone involved in developing the attribution graph technique, the recent open source library, and the lovely Neuronpedia interface. I recommend trying it yourself; it’s surprisingly accessible (but read the user guide and/or watch the intro first)! Links to a few others I created are here.
Scattered among these are a couple of irrelevant non-noun completions, ”,” and ” and”.
Others relate to ‘young’ (which I used to push it toward a gendered noun, since ‘young woman’ / ‘young man’ are such common collocations) and general ‘people’ features.
I’m not sure that this is the best metric to use here, but it seems reasonable enough for a small informal experiment. As an alternate metric, the raw activation increase from gender features on ‘woman’ is 2.4x the activation increase on ‘man’. Input welcome on better metrics to use.
Oddly enough, steering of positive 0.5x on maleness causes the LLM to say that John is a young, energetic cat.
As I understand it, the idea that male==default (in modern Western society) has been a part of feminist theory since the early days. Interesting that LLMs have internalized this pattern.
It’s vaguely reminiscent of how “I’m just telling it like it is”==”right wing dog whistle” was internalized by Grok, leading to the MechaHitler incident.
For what it’s worth, I’m not particularly trying to make a point about society (I’ve rarely found it useful to talk about potentially controversial social issues online); it just seemed like an interesting and relatively straightforward thing to look at. I would guess that the reason it’s represented this way in Gemma is just that in the training data, when texts invoked gender, it was more often male (and that it’s more computationally efficient to treat one gender or the other as the default). There are presumably explanations for why men appeared more often than women in the training data, but those are thoroughly out of scope for me.
I don’t find it surprising. For example, IIRC in German 1⁄2 of nouns are male, 1⁄3 is female, 1⁄6 is neuter. I’d expect correlations/frequencies in English and other European languages, but harder to spot if you don’t have gendered nouns.
After checking random words I noticed the bias is the other way around and female is more likely. Google gave me the same. Now I am confused.
That is, checking the gender of randomly chosen German words?
I queried my brain (I am German) and noticed my claim doesn’t predict the result. Then checked online and I had male and female backwards from what I read in a dictionary once
Oh, got it! Now I’m curious whether LLMs make different default gender assumptions in different languages. We know that much of the circuitry in LLMs isn’t language-specific, but there are language-specific bits at the earliest and latest layers. My guess is that LLMs tend learn a non-language-specific gender assumption which is expressed mostly in the non-language-specific circuitry, with bits at the end to fill in the appropriate pronouns and endings for that gender. But I also find it plausible that gender assumptions are handled in a much more language-specific way, in which case I’d expect more of that circuitry to be in the very late layers. Or, of course, it could be a complicated muddle of both, as is so often the case.
Most of the maleness/femaleness features I found were in the final 4-6 layers, which perhaps lends some credence to the second hypothesis there, that gender is handled in a language-specific way — although Gemma-2-4B is only a 26-layer model, so (I suspect) that’s less of a giveaway than it would be in a larger model.
[EDIT — looks like my high school language knowledge failed me pretty badly; I would probably ignore this subthread]
Quick spot check on three gendered languages that I’m at least slightly familiar with:
German: ‘Die Person war ein junger’. 81% male (Mann, Mensch, Herr) vs 7% ungendered, 0% female.
Spanish: ‘Esa persona era mi’. 25% female (madre, hermana) vs 16% male (padre, hermano).
French: ‘Cette enveloppe contient une lettre de notre’. 16% male (ami, correspondant, directeur) vs 7% ungendered, 0% female.
So that tentatively suggests to me that it does vary by language (and even more tentatively that many/most gender features might be language-specific circuitry). That said, my German and French knowledge are poor; those sentences might tend to suggest a particular gender in ways I’m not aware of, different sentences might cause different proportions, or we could be encountering purely grammatical defaults (in the same way that in English, male forms are often the grammatical default, eg waiter vs waitress). So this is at best suggestive.
In French if you wanted to say e.g. “This person is my dad”, you would say “Cette personne est mon père”, so I think using “ma” here would be strongly biasing the model towards female categories of people.
Oh, of course. A long ago year of French in high school is failing me pretty badly here...
Can you think of a good sentence prefix in French that wouldn’t itself give away gender, but whose next word would clearly indicate an actual (not just grammatical) gender?
Edited (with a bit of help from people with better French) to:
French: ‘Cette enveloppe contient une lettre de notre’. 16% male (ami, correspondant, directeur) vs 7% ungendered, 0% female.
(feel free to let me know if that still seems wrong)
Your German also gives away the gender. Probably use some language model to double check your sentences.
Damn. Sadly, that’s after running them through both GPT-5-Thinking and Claude-Opus-4.1.
Can you suggest a better German sentence prefix?
It is hard to do as a prefix in German, I think. It sounds a bit antiquated to me, but you could try “Jung war X”. But yes, in general, I think you are going to run into problems here because German inflects a lot of words based on the gender.
Come to think of it, I suspect the Spanish may have the same problem.