Gwern’s theories make sense to me. The data was roughly 50⁄50 on ⇐ 30 vs > 30, so that’s where I split it (and I’m only asking the model to pick one of those two options). Sexuality in the dataset is just male/female; they must have added the other options later (35829 male, 24117 female, and 2 blanks which I ignored). Agreed that this is very much a lower bound, also because I applied zero optimization to the system prompt and user prompts. This is ‘if you do the simplest possible thing, how good is it?’
No, unfortunately it’s all lowercased already in the dataset.
I agree! Dating site data is somewhat easy mode. I compared gender accuracy on the Persuade 2.0 corpus of students writing essays on a fixed topic, which I consider very much hard mode, and it was still 80% accurate. So I do think it’s getting some advantage from being in easy mode but not that much. I’ll note also that I’m removing a bunch of words that are giveaways for gender, and it only lost 2 percentage points of accuracy. So I do think it’s mostly working from implicit cues and distributional differences here rather than easy giveaways. Staab et al (thanks @gwern for pointing that paper out to me) looks more at explicit cues and compares to human investigators looking for explicit cues, so you may find that interesting as well.
Gwern’s theories make sense to me. The data was roughly 50⁄50 on ⇐ 30 vs > 30, so that’s where I split it (and I’m only asking the model to pick one of those two options). Sexuality in the dataset is just male/female; they must have added the other options later (35829 male, 24117 female, and 2 blanks which I ignored). Agreed that this is very much a lower bound, also because I applied zero optimization to the system prompt and user prompts. This is ‘if you do the simplest possible thing, how good is it?’
No, unfortunately it’s all lowercased already in the dataset.
I agree! Dating site data is somewhat easy mode. I compared gender accuracy on the Persuade 2.0 corpus of students writing essays on a fixed topic, which I consider very much hard mode, and it was still 80% accurate. So I do think it’s getting some advantage from being in easy mode but not that much. I’ll note also that I’m removing a bunch of words that are giveaways for gender, and it only lost 2 percentage points of accuracy. So I do think it’s mostly working from implicit cues and distributional differences here rather than easy giveaways. Staab et al (thanks @gwern for pointing that paper out to me) looks more at explicit cues and compares to human investigators looking for explicit cues, so you may find that interesting as well.