This seems like a reasonable approach. My main worry is that, if the probability of sampling good advice is very low (less than 1 in 1 billion), then good advice won’t appear at all in the training process, so the training process would result in exactly the same resulting agent as if the advice were never reported. This is essentially the “high variance” problem that you point out. In the case of plagiarism, it’s not clear how to come up with a proposal distribution assigning non-negligible probability to good advice (i.e. a list of references to plagiarized works).
This seems like a reasonable approach. My main worry is that, if the probability of sampling good advice is very low (less than 1 in 1 billion), then good advice won’t appear at all in the training process, so the training process would result in exactly the same resulting agent as if the advice were never reported. This is essentially the “high variance” problem that you point out. In the case of plagiarism, it’s not clear how to come up with a proposal distribution assigning non-negligible probability to good advice (i.e. a list of references to plagiarized works).