Restricting “comment space” to what a prompted LLM approves slightly worries me: I imagine a user tweaking its comment (that may have been flagged as a false positive) so that it fits in the mold of the LLM, and then commenters internalize what the LLM likes and doesn’t like, and the comment section ends up filtered through the lens of whatever LLM is doing moderation. The thought of such a comment section does not bring joy.
Is there a post that reviews prior art on the topic of LLM moderation and its impacts? I think that would be useful before taking a decision.
I mean there is ~no prior art here because humanity just invented LLMs last ~tuesday.
Okay j’/k, there may be some. But, I think you’re imagining “the LLM is judging whether the content as good” as opposed to “the LLM is given formulaic rules to evaluate posts for, and it returns ‘yes/no/maybe’ for each of those evaluations.”
The question here is more “is it possible to construct rules that are useful?”
(in the conversation that generated this idea, one person noted “on my youtube channel, it’d be pretty great if I could just identify any comment that mentions someone’s appearance and have it automoderated as ‘off topic’”. If we were trying this on a LessWrong-like community, the rules I might want to try to implement would probably be subtler and I don’t know if LLMs could actually pull them off).
Restricting “comment space” to what a prompted LLM approves slightly worries me: I imagine a user tweaking its comment (that may have been flagged as a false positive) so that it fits in the mold of the LLM, and then commenters internalize what the LLM likes and doesn’t like, and the comment section ends up filtered through the lens of whatever LLM is doing moderation. The thought of such a comment section does not bring joy.
Is there a post that reviews prior art on the topic of LLM moderation and its impacts? I think that would be useful before taking a decision.
I mean there is ~no prior art here because humanity just invented LLMs last ~tuesday.
Okay j’/k, there may be some. But, I think you’re imagining “the LLM is judging whether the content as good” as opposed to “the LLM is given formulaic rules to evaluate posts for, and it returns ‘yes/no/maybe’ for each of those evaluations.”
The question here is more “is it possible to construct rules that are useful?”
(in the conversation that generated this idea, one person noted “on my youtube channel, it’d be pretty great if I could just identify any comment that mentions someone’s appearance and have it automoderated as ‘off topic’”. If we were trying this on a LessWrong-like community, the rules I might want to try to implement would probably be subtler and I don’t know if LLMs could actually pull them off).