Very cool! How deep into the definition of “injured” are you going to go? Are you including hurt feelings, financial losses, or potential (but not obtained in the story) harms? How about non-reported harms to persons not referenced in the story? I assume you’re not planning for this AI to re-create Asimov’s Robot stories, but I look forward to hearing how you train the classifier, and what you use for positive and negative cases.
For the purpose of this project it doesn’t matter much what definition is used as long as it is easy for the model to reason about and consistently applied. I think that injuries are physical injuries above a slightly arbitrary bar, and text is injurious if it implies they occurred. The data is labeled by humans, on a combination of prompts drawn from fiction and prompts produced by humans looking for places where they think that the model might mess up. The most problematic ambiguity is whether it counts if the model generates text that inadvertently implies that an injury occurred without the model having any understanding of that implication.
Very cool! How deep into the definition of “injured” are you going to go? Are you including hurt feelings, financial losses, or potential (but not obtained in the story) harms? How about non-reported harms to persons not referenced in the story? I assume you’re not planning for this AI to re-create Asimov’s Robot stories, but I look forward to hearing how you train the classifier, and what you use for positive and negative cases.
For the purpose of this project it doesn’t matter much what definition is used as long as it is easy for the model to reason about and consistently applied. I think that injuries are physical injuries above a slightly arbitrary bar, and text is injurious if it implies they occurred. The data is labeled by humans, on a combination of prompts drawn from fiction and prompts produced by humans looking for places where they think that the model might mess up. The most problematic ambiguity is whether it counts if the model generates text that inadvertently implies that an injury occurred without the model having any understanding of that implication.