I think the RLHF might impede identification of specific named authors, but not group inferences. That’s the sort of distinction that safety training might impose, particularly anti-‘deepfake’ measures: generating a specific author from a text is the inverse of generating a text from a specific author, after all.
You can see in the paper I linked that group inference scales with model capability in a standard-looking way, with the largest/most-capable models doing best and smallest worst, and no inversions which correlate with RLHF/instruction-tuning. RLHF’d GPT-4 is just the best, by a substantial margin, and approaching the ground-truth labels. And so since a specific author is just an especially small group, identifying specific authors ought to work well. And I recall even the early GPT-3s being uncanny in guessing that I was the author from a few paragraphs, and obviously GPT-4 should be even better (as it is smarter, and I’ve continued writing publicly).
But in the past, whenever I’ve tried to get Claude-2 or GPT-4 to ‘write like Gwern’, they usually balk or refuse. Trying an author identification right now in ChatGPT-4 by pasting in the entirety of my most recent ML proposal (SVG generative models), which would not be in the training datasets of anything yet, ChatGPT-4 just spits out a list of ‘famous ML people’ like ‘Ilya Sutskever’ or ‘Daphne Koller’ or ‘Geoffrey Hinton’ - most of whom are obviously incorrect as they write nothing like me! (Asking for more candidates doesn’t help too much, as does asking for ‘bloggers’; when I eventually asked it a leading question whether I wrote it, it agrees I’m a plausible author and explains correctly why, but given the acquiescence bias & a leading question, that’s not impressive.)
Of course, this might just reflect the prompts or sampling variability. (The paper is using specific prompts for classification, and also reports low refusal rates, which doesn’t match my experience.) Still, worth keeping in mind that safety things might balk at stylometric tasks even if the underlying capability is there.
And so since a specific author is just an especially small group
That’s nicely said.
Another current MATS scholar is modeling this group identification very abstractly as: given a pool of token-generating finite-state automata, how quickly (as it receives more tokens) can a transformer trained on the output of those processes point with confidence to the one of those processes that’s producing the current token stream? I’ve been finding that a very useful mental model.
I think the RLHF might impede identification of specific named authors, but not group inferences. That’s the sort of distinction that safety training might impose, particularly anti-‘deepfake’ measures: generating a specific author from a text is the inverse of generating a text from a specific author, after all.
You can see in the paper I linked that group inference scales with model capability in a standard-looking way, with the largest/most-capable models doing best and smallest worst, and no inversions which correlate with RLHF/instruction-tuning. RLHF’d GPT-4 is just the best, by a substantial margin, and approaching the ground-truth labels. And so since a specific author is just an especially small group, identifying specific authors ought to work well. And I recall even the early GPT-3s being uncanny in guessing that I was the author from a few paragraphs, and obviously GPT-4 should be even better (as it is smarter, and I’ve continued writing publicly).
But in the past, whenever I’ve tried to get Claude-2 or GPT-4 to ‘write like Gwern’, they usually balk or refuse. Trying an author identification right now in ChatGPT-4 by pasting in the entirety of my most recent ML proposal (SVG generative models), which would not be in the training datasets of anything yet, ChatGPT-4 just spits out a list of ‘famous ML people’ like ‘Ilya Sutskever’ or ‘Daphne Koller’ or ‘Geoffrey Hinton’ - most of whom are obviously incorrect as they write nothing like me! (Asking for more candidates doesn’t help too much, as does asking for ‘bloggers’; when I eventually asked it a leading question whether I wrote it, it agrees I’m a plausible author and explains correctly why, but given the acquiescence bias & a leading question, that’s not impressive.)
Of course, this might just reflect the prompts or sampling variability. (The paper is using specific prompts for classification, and also reports low refusal rates, which doesn’t match my experience.) Still, worth keeping in mind that safety things might balk at stylometric tasks even if the underlying capability is there.
That’s nicely said.
Another current MATS scholar is modeling this group identification very abstractly as: given a pool of token-generating finite-state automata, how quickly (as it receives more tokens) can a transformer trained on the output of those processes point with confidence to the one of those processes that’s producing the current token stream? I’ve been finding that a very useful mental model.