You’re totally right that we’ll probably have low quality on those prompts. But we’re defining quality with respect to the overall prompt distribution, and so as long as prompts that can’t be realistically completed non-injuriously are rare, our average quality won’t take that big a hit.
I was confused by Buck’s response here because I thought we were going for worst-case quality until I realised:
The model will have low quality on those prompts almost by definition—that’s the goal.
Given that, we also want to have a generally useful model—for which the relevant distribution is ‘all fanfiction’, not “prompts that are especially likely to have a violent continuation”.
In between those two cases is ‘snippets that were completed injuriously in the original fanfic … but could plausibly have non-violent completions’, which seems like the interesting case to me.
I suppose one possibility is to construct a human-labelled dataset of specifically these cases to evaluate on.
You’re totally right that we’ll probably have low quality on those prompts. But we’re defining quality with respect to the overall prompt distribution, and so as long as prompts that can’t be realistically completed non-injuriously are rare, our average quality won’t take that big a hit.
I was confused by Buck’s response here because I thought we were going for worst-case quality until I realised:
The model will have low quality on those prompts almost by definition—that’s the goal.
Given that, we also want to have a generally useful model—for which the relevant distribution is ‘all fanfiction’, not “prompts that are especially likely to have a violent continuation”.
In between those two cases is ‘snippets that were completed injuriously in the original fanfic … but could plausibly have non-violent completions’, which seems like the interesting case to me.
I suppose one possibility is to construct a human-labelled dataset of specifically these cases to evaluate on.