We spent a lot of time doing exactly that in an automated manner. It’s too time consuming to filter that many samples manually, but we did several iterations of the following loop:
comprehensively define quality criteria
generate a dataset
run a second instance of Claude to rate samples by quality criteria
drop the bad ones
manually read a subset of what’s left and identify remaining failure modes.
update the quality criteria for the next iteration
I completely agree with you — I personally have filtered a thousand samples manually, and it takes a while. Finding a good human-LLM centaur solution is very helpful. Sounds like you know all the tricks at least as well as I do.
We spent a lot of time doing exactly that in an automated manner. It’s too time consuming to filter that many samples manually, but we did several iterations of the following loop:
comprehensively define quality criteria
generate a dataset
run a second instance of Claude to rate samples by quality criteria
drop the bad ones
manually read a subset of what’s left and identify remaining failure modes.
update the quality criteria for the next iteration
I completely agree with you — I personally have filtered a thousand samples manually, and it takes a while. Finding a good human-LLM centaur solution is very helpful. Sounds like you know all the tricks at least as well as I do.
Thank you!