LLM Automoderation Idea (we could try this on LessWrong but it feels like something that’s, like, more naturally part of a forum that’s designed-from-the-ground-up around it)
Authors can create moderation guidelines, which get enforced by LLMs that read new comments, and have access to some user metadata. Comments get deleted / etc by the LLM. (You can also have tools other that deletion, like collapsing comments by default)
The moderation guidelines are public. It’s commenter’s job to write comments that pass.
Authors pay the fees for the LLMs doing the review.
(I’m currently thinking about this more like “I am interested in seeing what equilibria a setup like this would end up with” than like “this is a good idea”)
but maybe it can be quite lax if a cheap model is used
So you now need to pay before you post?
privacy concerns
and comments are disabled when you’re out of funds? natural consequence but lol.
Long comments that don’t pass in the first try likely motivates comment author to a. jailbreak / b. let another LLM rewrite it rather than drafting a new one
I would like it more if it is the forum paying for the model, and getting the cost back from elsewhere.
and comments are disabled when you’re out of funds? natural consequence but lol.
There’s a few ways you could do it. It occurs me now it could actually be the commenter’s job to pay via microtransactions, and maybe the author can tip back if they like it via a Flattr-ish. This also maybe solves the rate limits.
You could also just set it to “when you run out of money, everyone can commit without restriction.”
You could also have, like, everyone just pays a monthly subscription to participate. I think the above ideas are kinda cute tho.
privacy concerns
I was imagining this for public-ish internet where I’d expect it to be digested for the next round of LLM training anyway.
the commenter’s job to pay via microtransactions, and maybe the author can tip back if they like it via a Flattr-ish.
Yes, I feel like it is worse than author or forum paying though, because of incentives. There are other possible ways like the commenter paying for failed comments and author/forum paying for those that passed.
Monthly subscription is also possible yeah. I had this in mind and swept it under “the forum paying for the model and getting the cost back from elsewhere”.
privacy concerns
you misunderstood, I meant that some people probably don’t want their account to be traceable to their real identity, any monetary transaction is problematic unless crypto
Restricting “comment space” to what a prompted LLM approves slightly worries me: I imagine a user tweaking its comment (that may have been flagged as a false positive) so that it fits in the mold of the LLM, and then commenters internalize what the LLM likes and doesn’t like, and the comment section ends up filtered through the lens of whatever LLM is doing moderation. The thought of such a comment section does not bring joy.
Is there a post that reviews prior art on the topic of LLM moderation and its impacts? I think that would be useful before taking a decision.
I mean there is ~no prior art here because humanity just invented LLMs last ~tuesday.
Okay j’/k, there may be some. But, I think you’re imagining “the LLM is judging whether the content as good” as opposed to “the LLM is given formulaic rules to evaluate posts for, and it returns ‘yes/no/maybe’ for each of those evaluations.”
The question here is more “is it possible to construct rules that are useful?”
(in the conversation that generated this idea, one person noted “on my youtube channel, it’d be pretty great if I could just identify any comment that mentions someone’s appearance and have it automoderated as ‘off topic’”. If we were trying this on a LessWrong-like community, the rules I might want to try to implement would probably be subtler and I don’t know if LLMs could actually pull them off).
LLM Automoderation Idea (we could try this on LessWrong but it feels like something that’s, like, more naturally part of a forum that’s designed-from-the-ground-up around it)
Authors can create moderation guidelines, which get enforced by LLMs that read new comments, and have access to some user metadata. Comments get deleted / etc by the LLM. (You can also have tools other that deletion, like collapsing comments by default)
The moderation guidelines are public. It’s commenter’s job to write comments that pass.
Authors pay the fees for the LLMs doing the review.
(I’m currently thinking about this more like “I am interested in seeing what equilibria a setup like this would end up with” than like “this is a good idea”)
rate limits seems like a must
but maybe it can be quite lax if a cheap model is used
So you now need to pay before you post?
privacy concerns
and comments are disabled when you’re out of funds? natural consequence but lol.
Long comments that don’t pass in the first try likely motivates comment author to a. jailbreak / b. let another LLM rewrite it rather than drafting a new one
I would like it more if it is the forum paying for the model, and getting the cost back from elsewhere.
There’s a few ways you could do it. It occurs me now it could actually be the commenter’s job to pay via microtransactions, and maybe the author can tip back if they like it via a Flattr-ish. This also maybe solves the rate limits.
You could also just set it to “when you run out of money, everyone can commit without restriction.”
You could also have, like, everyone just pays a monthly subscription to participate. I think the above ideas are kinda cute tho.
I was imagining this for public-ish internet where I’d expect it to be digested for the next round of LLM training anyway.
Yes, I feel like it is worse than author or forum paying though, because of incentives. There are other possible ways like the commenter paying for failed comments and author/forum paying for those that passed.
Monthly subscription is also possible yeah. I had this in mind and swept it under “the forum paying for the model and getting the cost back from elsewhere”.
you misunderstood, I meant that some people probably don’t want their account to be traceable to their real identity, any monetary transaction is problematic unless crypto
Kimi-K2 is probably a good model to try this with, it’s both cheap (pareto frontier of LMSYS ELO x cost) and relatively conducive to sanity (which matches my personal experience with it vs Claude models—the other main LLMs I use).
There exists at least one subscription API service for it (featherless.ai, though it’s a bit flaky), which may make cost considerations easier.
Restricting “comment space” to what a prompted LLM approves slightly worries me: I imagine a user tweaking its comment (that may have been flagged as a false positive) so that it fits in the mold of the LLM, and then commenters internalize what the LLM likes and doesn’t like, and the comment section ends up filtered through the lens of whatever LLM is doing moderation. The thought of such a comment section does not bring joy.
Is there a post that reviews prior art on the topic of LLM moderation and its impacts? I think that would be useful before taking a decision.
I mean there is ~no prior art here because humanity just invented LLMs last ~tuesday.
Okay j’/k, there may be some. But, I think you’re imagining “the LLM is judging whether the content as good” as opposed to “the LLM is given formulaic rules to evaluate posts for, and it returns ‘yes/no/maybe’ for each of those evaluations.”
The question here is more “is it possible to construct rules that are useful?”
(in the conversation that generated this idea, one person noted “on my youtube channel, it’d be pretty great if I could just identify any comment that mentions someone’s appearance and have it automoderated as ‘off topic’”. If we were trying this on a LessWrong-like community, the rules I might want to try to implement would probably be subtler and I don’t know if LLMs could actually pull them off).