Mostly because I’m not sure if he’s reinforcing people to not answer with LLM answers, or if he’s reinforcing not telling him it’s from an LLM.
Which behavior he’s reinforcing is not up to him, it depends on the learner as well. Let’s take an analogy. Alice tells Bob “don’t steal”, and Bob interprets it as “don’t get caught stealing”. Who’s in the wrong here? Bob, of course. He’s the one choosing to ignore the intent of the request, like a misaligned AI. Same for people who misinterpret “don’t post AI slop” as “get better at passing off AI slop as human”. How such people can become genuinely aligned is a good question, and I’m not sure it can be done reliably with reinforcement, because all reinforcement has this kind of problem.
Which behavior he’s reinforcing is not up to him, it depends on the learner as well. Let’s take an analogy. Alice tells Bob “don’t steal”, and Bob interprets it as “don’t get caught stealing”. Who’s in the wrong here? Bob, of course. He’s the one choosing to ignore the intent of the request, like a misaligned AI. Same for people who misinterpret “don’t post AI slop” as “get better at passing off AI slop as human”. How such people can become genuinely aligned is a good question, and I’m not sure it can be done reliably with reinforcement, because all reinforcement has this kind of problem.