I think the current LLM detection system is flawed, so I proposs a bet. In the proposal is accepted I will write 10 3 paragraph less-wrong styles post, 5 AI generated and 5 normal. If the LessWrong LLM detection system correctly assesses 8⁄10 or more, I lose the bet. If the detection system correctly fewer than 8⁄10, I will win the bet. The LLM posts will be fully AI generated, but I will be able to provide it a style sheet and instruct it to mimic my style. Additionally, I can give up to 2 prompts of feedback to modify the post. The human posts would all be generated by me, but would try to follow the same style sheet and use grammarly for spelling check. Finally, the LLM system must run without human intervention. Even if a human judge can “feel” something is LLM, this is not what the experiment is designed to measure. I would be willing to wager up to 500$ on this.
I think you could probably succeed at this, but you’re testing something different from where the system’s value comes from. While the anti-LLM policy is trying to prevent all LLM-written posts, most of the value comes from preventing low-effort LLM-written posts. If you can get a high-effort LLM-written post past the filter then that’s sort of bad, but doesn’t mean the filter isn’t doing its job.
I think you could easily fix this by ensuring you adhere to the same non-LLM style guide and (if both sides are paranoid) having a third party to judge if sandbagging occurred.
The policy should be changed to target what can actually be enforced then. Unenforceable policies on morally neutral behaviors only punish good actors and promote bad ones.
It’s not a neutral behavior though. It’s much easier to spam the site with LLM-written posts, and obviously LLM-written posts are bad for reasons that are related to why it’s obvious they’re LLM-written. If you prevent people from spamming the site with low-effort slop, and some people still use LLMs in a high-effort way to write better posts (even though this is still against the rules), then that’s still a win.
I think we agree that low-effort, unreviewed LLM posts are bad, but this should be targeted instead of high effort cyborg writing. I don’t like it that posts which lightly use LLMs need to use these weird LLM boxes.
I think the current LLM detection system is flawed, so I proposs a bet. In the proposal is accepted I will write 10 3 paragraph less-wrong styles post, 5 AI generated and 5 normal. If the LessWrong LLM detection system correctly assesses 8⁄10 or more, I lose the bet. If the detection system correctly fewer than 8⁄10, I will win the bet. The LLM posts will be fully AI generated, but I will be able to provide it a style sheet and instruct it to mimic my style. Additionally, I can give up to 2 prompts of feedback to modify the post. The human posts would all be generated by me, but would try to follow the same style sheet and use grammarly for spelling check. Finally, the LLM system must run without human intervention. Even if a human judge can “feel” something is LLM, this is not what the experiment is designed to measure. I would be willing to wager up to 500$ on this.
I think you could probably succeed at this, but you’re testing something different from where the system’s value comes from. While the anti-LLM policy is trying to prevent all LLM-written posts, most of the value comes from preventing low-effort LLM-written posts. If you can get a high-effort LLM-written post past the filter then that’s sort of bad, but doesn’t mean the filter isn’t doing its job.
Also I’d be worried about the human side sandbagging by deliberately sounding LLMish.
I think you could easily fix this by ensuring you adhere to the same non-LLM style guide and (if both sides are paranoid) having a third party to judge if sandbagging occurred.
The policy should be changed to target what can actually be enforced then. Unenforceable policies on morally neutral behaviors only punish good actors and promote bad ones.
It’s not a neutral behavior though. It’s much easier to spam the site with LLM-written posts, and obviously LLM-written posts are bad for reasons that are related to why it’s obvious they’re LLM-written. If you prevent people from spamming the site with low-effort slop, and some people still use LLMs in a high-effort way to write better posts (even though this is still against the rules), then that’s still a win.
I think we agree that low-effort, unreviewed LLM posts are bad, but this should be targeted instead of high effort cyborg writing. I don’t like it that posts which lightly use LLMs need to use these weird LLM boxes.