It would be valuable to have a dataset of these cases that could be privately shared among researchers (to avoid it ending up in the training data) (it would also be good to include canary strings for the same reason). Would you be interested in seeding that with the cases you’ve recorded? That would enable other analyses, eg looking for additional words like ‘recursion’ and ‘ache’ that occur disproportionately often.
It would be valuable to have a dataset of these cases that could be privately shared among researchers (to avoid it ending up in the training data) (it would also be good to include canary strings for the same reason). Would you be interested in seeding that with the cases you’ve recorded? That would enable other analyses, eg looking for additional words like ‘recursion’ and ‘ache’ that occur disproportionately often.