Yeah “try it and see” is the gold standard. I do know that for stuff which boils down to “monitor for patterns in text data that is too large to plausibly be examined by a team of humans we could afford to hire” I’ve been favoring the approach of
Grab 100 random data points, run the most obvious possible prompt on them to get reasoning + label(s) + confidence
Spot check the high confidence ones to make sure you’re not getting confident BS out of the model (you can alternatively start by writing two very different prompts for the same labeling task and see where the answers differ, that will also work)
Look at the low-confidence ones, see if the issue is your labeling scheme / unclear prompt / whatever—usually it’s pretty obvious where the model is getting confused
Tweak your prompt, comparing new labels to old. Examine any data points that have changed—often your prompt change fixed the original problems but caused new ones to surface. Note that for this step you want your iteration time to be under 10 minutes per iteration and ideally under 10 seconds from “hit enter key” to “results show up on screen”. Any of the major LLMs can trivially vibe code you an acceptable spreadsheet-like interface for this, including hooking up the tool calling API to get structured data out of your prompts for easy inspection.
Once you’re reasonably happy with the performance on 100 samples, bump to 1000, run all 1000 datapoints against all of the prompts you’ve iterated on so far, and focus on the datapoints which got inconsistent results between prompts or had a low-confidence answer on the last one
Once I’m happy with the performance on a sample of 1000 I rarely encounter major issues with the prompt, other than ones I was already aware of and couldn’t be bothered to fix (the usual case for that is “I realize that the data I’m asking the model to label doesn’t contain all decision-relevant information, and that when I’m labeling I sometimes have to fetch extra data, and I don’t really want to build that infrastructure right now so I’ll call it “good enough” and ship it, or “not good enough” and abandon it).
TBH I think that most of the reason this method works for me is that it’s very effective at shoving the edge cases to me early while not wasting a bunch of my attention on the stuff that is always easy.
Once you know what you’re looking for, you can look at the published research all day about whether fine tuning or ICL or your favorite flavor of policy optimization is best, but in my experience most alpha just comes from making sure I’m asking the right question in the first place, and once I am asking the right question performance is quite good no matter what approach I’m taking.
Yeah “try it and see” is the gold standard. I do know that for stuff which boils down to “monitor for patterns in text data that is too large to plausibly be examined by a team of humans we could afford to hire” I’ve been favoring the approach of
Grab 100 random data points, run the most obvious possible prompt on them to get reasoning + label(s) + confidence
Spot check the high confidence ones to make sure you’re not getting confident BS out of the model (you can alternatively start by writing two very different prompts for the same labeling task and see where the answers differ, that will also work)
Look at the low-confidence ones, see if the issue is your labeling scheme / unclear prompt / whatever—usually it’s pretty obvious where the model is getting confused
Tweak your prompt, comparing new labels to old. Examine any data points that have changed—often your prompt change fixed the original problems but caused new ones to surface. Note that for this step you want your iteration time to be under 10 minutes per iteration and ideally under 10 seconds from “hit enter key” to “results show up on screen”. Any of the major LLMs can trivially vibe code you an acceptable spreadsheet-like interface for this, including hooking up the tool calling API to get structured data out of your prompts for easy inspection.
Once you’re reasonably happy with the performance on 100 samples, bump to 1000, run all 1000 datapoints against all of the prompts you’ve iterated on so far, and focus on the datapoints which got inconsistent results between prompts or had a low-confidence answer on the last one
Once I’m happy with the performance on a sample of 1000 I rarely encounter major issues with the prompt, other than ones I was already aware of and couldn’t be bothered to fix (the usual case for that is “I realize that the data I’m asking the model to label doesn’t contain all decision-relevant information, and that when I’m labeling I sometimes have to fetch extra data, and I don’t really want to build that infrastructure right now so I’ll call it “good enough” and ship it, or “not good enough” and abandon it).
TBH I think that most of the reason this method works for me is that it’s very effective at shoving the edge cases to me early while not wasting a bunch of my attention on the stuff that is always easy.
Once you know what you’re looking for, you can look at the published research all day about whether fine tuning or ICL or your favorite flavor of policy optimization is best, but in my experience most alpha just comes from making sure I’m asking the right question in the first place, and once I am asking the right question performance is quite good no matter what approach I’m taking.