My first thought on “filter that for accuracy somehow” was to generate, say, 5000 of them, sit down and laboriously read them all, then throw away (or even edit) the obviously wrong ones. Not exactly an easy technique for others to replicate, but often a reasonably good way to get an effective training set: prompt engineer, generate, filter manually, retrain smaller/simpler model. Sometimes you can even rinse and repeat on this approach, if your simpler model is generalizing usefully, or at least take a second look at cases where the trained model disagrees with your manual choice, and see if you were wrong and it’s right — though obviously that gets riskier the more you let earlier versions of the model have input into the training data of later ones.
We spent a lot of time doing exactly that in an automated manner. It’s too time consuming to filter that many samples manually, but we did several iterations of the following loop:
comprehensively define quality criteria
generate a dataset
run a second instance of Claude to rate samples by quality criteria
drop the bad ones
manually read a subset of what’s left and identify remaining failure modes.
update the quality criteria for the next iteration
I completely agree with you — I personally have filtered a thousand samples manually, and it takes a while. Finding a good human-LLM centaur solution is very helpful. Sounds like you know all the tricks at least as well as I do.
My first thought on “filter that for accuracy somehow” was to generate, say, 5000 of them, sit down and laboriously read them all, then throw away (or even edit) the obviously wrong ones. Not exactly an easy technique for others to replicate, but often a reasonably good way to get an effective training set: prompt engineer, generate, filter manually, retrain smaller/simpler model. Sometimes you can even rinse and repeat on this approach, if your simpler model is generalizing usefully, or at least take a second look at cases where the trained model disagrees with your manual choice, and see if you were wrong and it’s right — though obviously that gets riskier the more you let earlier versions of the model have input into the training data of later ones.
We spent a lot of time doing exactly that in an automated manner. It’s too time consuming to filter that many samples manually, but we did several iterations of the following loop:
comprehensively define quality criteria
generate a dataset
run a second instance of Claude to rate samples by quality criteria
drop the bad ones
manually read a subset of what’s left and identify remaining failure modes.
update the quality criteria for the next iteration
I completely agree with you — I personally have filtered a thousand samples manually, and it takes a while. Finding a good human-LLM centaur solution is very helpful. Sounds like you know all the tricks at least as well as I do.
Thank you!