I also have a prompt I’m pretty happy with, would be enthusiastic to trade. Prompts aren’t on the same level as a well-chosen tuning dataset, though. turns out “well chosen” is a pretty high bar, and tuning is more expensive than I realized.
I wonder how hard it would be to iteratively write a prompt that can consistently mimic your judgment in distinguishing between good takes and bad takes for the purposes of curating a well-chosen tuning dataset. I’d expect not that hard.
yeah, preprocessing with an AI’s help will be my next go at it. likely a bit more than just curation, also various data cleaning and reformatting. I might try very low epoch count, very high learning rate, minimum lora size, to see if that can get me anywhere. Also going to try local tuning.
my suspicion is that in-context learning is stronger, but I currently expect to prefer fine-tuning if it works at all, because it can force lower dimensional update (only in the low millions rather than gbs of kv cache) and I’d expect that to generalize better. probably there are like 30 papers on this, actually, brb
conclusion: We compare OPT models (Zhang et al., 2022) ranging from 125M to 30B parameters on three classification datasets across two tasks. We find that for both approaches, performance improves as models become larger. For the largest models we experiment with (OPT-30B), we find that FT outperforms ICL on both in-domain and OOD performance and even improves further as we train on more data. However, our results also demonstrate that the performance of both FT and ICL exhibits high variance...
both Min et al. (2022) and ? showed that replacing the labels provided in the in-context training examples with random task labels barely affects the performance of in-context learning, implying that the in-context learning mechanism is more about identifying the task than about learning it. Similarly, both Webson & Pavlick (2022) and Lampinen et al. (2022) studied whether LMs truly understand the text of the in-context examples, and found that irrelevant text that mimic the task can have a similar effect as learning with true
training examples
okay so yeah my intuition is updated a bit from this paper browsing session but the main thing I’m getting so far is “lol idk try it and see what sticks, both of these things are unreliable and sometimes one is better than the other”
okay, so icl is good and then bad, whereas sft is bad and then good? thanks for clearing that up for me. looking at their charts… weird, they’re clustering multiple-choice questions by answer letter? their charts of clustering and intrinsic dimension sure would be cool if I was reading closely enough to be sure they meant anything
https://arxiv.org/abs/2502.04580 “Our findings reveal a striking dichotomy: while ICL initially matches the efficiency of a Bayes optimal estimator, its efficiency significantly deteriorates in long context” sure yes makes sense, this feels overwrought but probably true
I think the thing I want out of SFT is probably just that it works to change implicit priors in the model without the model having verbal reference to the things changed, so it can bypass a bunch of verbal tics and go directly for imitation
Yeah “try it and see” is the gold standard. I do know that for stuff which boils down to “monitor for patterns in text data that is too large to plausibly be examined by a team of humans we could afford to hire” I’ve been favoring the approach of
Grab 100 random data points, run the most obvious possible prompt on them to get reasoning + label(s) + confidence
Spot check the high confidence ones to make sure you’re not getting confident BS out of the model (you can alternatively start by writing two very different prompts for the same labeling task and see where the answers differ, that will also work)
Look at the low-confidence ones, see if the issue is your labeling scheme / unclear prompt / whatever—usually it’s pretty obvious where the model is getting confused
Tweak your prompt, comparing new labels to old. Examine any data points that have changed—often your prompt change fixed the original problems but caused new ones to surface. Note that for this step you want your iteration time to be under 10 minutes per iteration and ideally under 10 seconds from “hit enter key” to “results show up on screen”. Any of the major LLMs can trivially vibe code you an acceptable spreadsheet-like interface for this, including hooking up the tool calling API to get structured data out of your prompts for easy inspection.
Once you’re reasonably happy with the performance on 100 samples, bump to 1000, run all 1000 datapoints against all of the prompts you’ve iterated on so far, and focus on the datapoints which got inconsistent results between prompts or had a low-confidence answer on the last one
Once I’m happy with the performance on a sample of 1000 I rarely encounter major issues with the prompt, other than ones I was already aware of and couldn’t be bothered to fix (the usual case for that is “I realize that the data I’m asking the model to label doesn’t contain all decision-relevant information, and that when I’m labeling I sometimes have to fetch extra data, and I don’t really want to build that infrastructure right now so I’ll call it “good enough” and ship it, or “not good enough” and abandon it).
TBH I think that most of the reason this method works for me is that it’s very effective at shoving the edge cases to me early while not wasting a bunch of my attention on the stuff that is always easy.
Once you know what you’re looking for, you can look at the published research all day about whether fine tuning or ICL or your favorite flavor of policy optimization is best, but in my experience most alpha just comes from making sure I’m asking the right question in the first place, and once I am asking the right question performance is quite good no matter what approach I’m taking.
iirc @habryka had a prompt that worked pretty well to get GPT (I forget which version) to become a lesswronger that could yell at you for bad takes
I also have a prompt I’m pretty happy with, would be enthusiastic to trade. Prompts aren’t on the same level as a well-chosen tuning dataset, though. turns out “well chosen” is a pretty high bar, and tuning is more expensive than I realized.
I wonder how hard it would be to iteratively write a prompt that can consistently mimic your judgment in distinguishing between good takes and bad takes for the purposes of curating a well-chosen tuning dataset. I’d expect not that hard.
yeah, preprocessing with an AI’s help will be my next go at it. likely a bit more than just curation, also various data cleaning and reformatting. I might try very low epoch count, very high learning rate, minimum lora size, to see if that can get me anywhere. Also going to try local tuning.
my suspicion is that in-context learning is stronger, but I currently expect to prefer fine-tuning if it works at all, because it can force lower dimensional update (only in the low millions rather than gbs of kv cache) and I’d expect that to generalize better. probably there are like 30 papers on this, actually, brb
edit: hmmm...
hmmmmmmmmn
hmmm
wat that’s not alignment
ok looks tasty, read me further?
pac bounds for icl?? what’s the gotcha. but in the intro they say
okay so yeah my intuition is updated a bit from this paper browsing session but the main thing I’m getting so far is “lol idk try it and see what sticks, both of these things are unreliable and sometimes one is better than the other”
it can do what now?
okay, so icl is good and then bad, whereas sft is bad and then good? thanks for clearing that up for me. looking at their charts… weird, they’re clustering multiple-choice questions by answer letter? their charts of clustering and intrinsic dimension sure would be cool if I was reading closely enough to be sure they meant anything
https://arxiv.org/abs/2502.04580 “Our findings reveal a striking dichotomy: while ICL initially matches the efficiency of a Bayes optimal estimator, its efficiency significantly deteriorates in long context” sure yes makes sense, this feels overwrought but probably true
I think the thing I want out of SFT is probably just that it works to change implicit priors in the model without the model having verbal reference to the things changed, so it can bypass a bunch of verbal tics and go directly for imitation
Yeah “try it and see” is the gold standard. I do know that for stuff which boils down to “monitor for patterns in text data that is too large to plausibly be examined by a team of humans we could afford to hire” I’ve been favoring the approach of
Grab 100 random data points, run the most obvious possible prompt on them to get reasoning + label(s) + confidence
Spot check the high confidence ones to make sure you’re not getting confident BS out of the model (you can alternatively start by writing two very different prompts for the same labeling task and see where the answers differ, that will also work)
Look at the low-confidence ones, see if the issue is your labeling scheme / unclear prompt / whatever—usually it’s pretty obvious where the model is getting confused
Tweak your prompt, comparing new labels to old. Examine any data points that have changed—often your prompt change fixed the original problems but caused new ones to surface. Note that for this step you want your iteration time to be under 10 minutes per iteration and ideally under 10 seconds from “hit enter key” to “results show up on screen”. Any of the major LLMs can trivially vibe code you an acceptable spreadsheet-like interface for this, including hooking up the tool calling API to get structured data out of your prompts for easy inspection.
Once you’re reasonably happy with the performance on 100 samples, bump to 1000, run all 1000 datapoints against all of the prompts you’ve iterated on so far, and focus on the datapoints which got inconsistent results between prompts or had a low-confidence answer on the last one
Once I’m happy with the performance on a sample of 1000 I rarely encounter major issues with the prompt, other than ones I was already aware of and couldn’t be bothered to fix (the usual case for that is “I realize that the data I’m asking the model to label doesn’t contain all decision-relevant information, and that when I’m labeling I sometimes have to fetch extra data, and I don’t really want to build that infrastructure right now so I’ll call it “good enough” and ship it, or “not good enough” and abandon it).
TBH I think that most of the reason this method works for me is that it’s very effective at shoving the edge cases to me early while not wasting a bunch of my attention on the stuff that is always easy.
Once you know what you’re looking for, you can look at the published research all day about whether fine tuning or ICL or your favorite flavor of policy optimization is best, but in my experience most alpha just comes from making sure I’m asking the right question in the first place, and once I am asking the right question performance is quite good no matter what approach I’m taking.
willing to share prompts?