AnnaJo comments on the gears to ascenscion’s Shortform

AnnaJo 18 Aug 2025 22:21 UTC
2 points
0
iirc @habryka had a prompt that worked pretty well to get GPT (I forget which version) to become a lesswronger that could yell at you for bad takes
- the gears to ascension 18 Aug 2025 22:24 UTC
  4 points
  0
  Parent
  I also have a prompt I’m pretty happy with, would be enthusiastic to trade. Prompts aren’t on the same level as a well-chosen tuning dataset, though. turns out “well chosen” is a pretty high bar, and tuning is more expensive than I realized.
  - faul_sname 19 Aug 2025 3:57 UTC
    4 points
    0
    Parent
    I wonder how hard it would be to iteratively write a prompt that can consistently mimic your judgment in distinguishing between good takes and bad takes for the purposes of curating a well-chosen tuning dataset. I’d expect not that hard.
    - the gears to ascension 19 Aug 2025 6:44 UTC
      4 points
      0
      Parent
      yeah, preprocessing with an AI’s help will be my next go at it. likely a bit more than just curation, also various data cleaning and reformatting. I might try very low epoch count, very high learning rate, minimum lora size, to see if that can get me anywhere. Also going to try local tuning.
      
      my suspicion is that in-context learning is stronger, but I currently expect to prefer fine-tuning if it works at all, because it can force lower dimensional update (only in the low millions rather than gbs of kv cache) and I’d expect that to generalize better. probably there are like 30 papers on this, actually, brb
      
      edit: hmmm...
      
      Recently, in-context learning has gained popularity over fine-tuning due to its simplicity and improved out-of-domain generalization, and because extensive evidence shows that fine-tuned models pick up on spurious correlations … Our results show that fine-tuned language models can in fact generalize well out-of-domain
      
      ...
      
      conclusion: We compare OPT models (Zhang et al., 2022) ranging from 125M to 30B parameters on three classification datasets across two tasks. We find that for both approaches, performance improves as models become larger. For the largest models we experiment with (OPT-30B), we find that FT outperforms ICL on both in-domain and OOD performance and even improves further as we train on more data. However, our results also demonstrate that the performance of both FT and ICL exhibits high variance...
      
      hmmmmmmmmn
      
      For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning. We developed several datasets featuring implicit patterns, such as sequences determining answers through parity or identifying reducible terms in calculations. We then evaluated the models’ understanding of these patterns under both fine-tuning and ICL across models ranging from 0.5B to 7B parameters. The results indicate that models employing ICL can quickly grasp deep patterns and significantly improve accuracy. In contrast, fine-tuning, despite utilizing thousands of times more training samples than ICL, achieved only limited improvements. We also proposed circuit shift theory from a mechanistic interpretability’s view to explain why ICL wins.
      
      hmmm
      
      wat that’s not alignment
      
      ok looks tasty, read me further?
      
      pac bounds for icl?? what’s the gotcha. but in the intro they say
      
      both Min et al. (2022) and ? showed that replacing the labels provided in the in-context training examples with random task labels barely affects the performance of in-context learning, implying that the in-context learning mechanism is more about identifying the task than about learning it. Similarly, both Webson & Pavlick (2022) and Lampinen et al. (2022) studied whether LMs truly understand the text of the in-context examples, and found that irrelevant text that mimic the task can have a similar effect as learning with true training examples
      
      okay so yeah my intuition is updated a bit from this paper browsing session but the main thing I’m getting so far is “lol idk try it and see what sticks, both of these things are unreliable and sometimes one is better than the other”
      
      It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also diminish upon further training
      
      it can do what now?
      
      In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content. In contrast, the probability landscape obtained with SFT is fuzzier and semantically mixed. In the second half of the model, the fine-tuned representations develop probability modes that better encode the identity of answers, while the landscape of ICL representations is characterized by less defined peaks.
      
      okay, so icl is good and then bad, whereas sft is bad and then good? thanks for clearing that up for me. looking at their charts… weird, they’re clustering multiple-choice questions by answer letter? their charts of clustering and intrinsic dimension sure would be cool if I was reading closely enough to be sure they meant anything
      
      https://arxiv.org/abs/2502.04580 “Our findings reveal a striking dichotomy: while ICL initially matches the efficiency of a Bayes optimal estimator, its efficiency significantly deteriorates in long context” sure yes makes sense, this feels overwrought but probably true
      
      I think the thing I want out of SFT is probably just that it works to change implicit priors in the model without the model having verbal reference to the things changed, so it can bypass a bunch of verbal tics and go directly for imitation
      - faul_sname 19 Aug 2025 9:07 UTC
        5 points
        0
        Parent
        Yeah “try it and see” is the gold standard. I do know that for stuff which boils down to “monitor for patterns in text data that is too large to plausibly be examined by a team of humans we could afford to hire” I’ve been favoring the approach of
        
        Grab 100 random data points, run the most obvious possible prompt on them to get reasoning + label(s) + confidence
        Spot check the high confidence ones to make sure you’re not getting confident BS out of the model (you can alternatively start by writing two very different prompts for the same labeling task and see where the answers differ, that will also work)
        Look at the low-confidence ones, see if the issue is your labeling scheme / unclear prompt / whatever—usually it’s pretty obvious where the model is getting confused
        Tweak your prompt, comparing new labels to old. Examine any data points that have changed—often your prompt change fixed the original problems but caused new ones to surface. Note that for this step you want your iteration time to be under 10 minutes per iteration and ideally under 10 seconds from “hit enter key” to “results show up on screen”. Any of the major LLMs can trivially vibe code you an acceptable spreadsheet-like interface for this, including hooking up the tool calling API to get structured data out of your prompts for easy inspection.
        Once you’re reasonably happy with the performance on 100 samples, bump to 1000, run all 1000 datapoints against all of the prompts you’ve iterated on so far, and focus on the datapoints which got inconsistent results between prompts or had a low-confidence answer on the last one
        
        Once I’m happy with the performance on a sample of 1000 I rarely encounter major issues with the prompt, other than ones I was already aware of and couldn’t be bothered to fix (the usual case for that is “I realize that the data I’m asking the model to label doesn’t contain all decision-relevant information, and that when I’m labeling I sometimes have to fetch extra data, and I don’t really want to build that infrastructure right now so I’ll call it “good enough” and ship it, or “not good enough” and abandon it).
        
        TBH I think that most of the reason this method works for me is that it’s very effective at shoving the edge cases to me early while not wasting a bunch of my attention on the stuff that is always easy.
        
        Once you know what you’re looking for, you can look at the published research all day about whether fine tuning or ICL or your favorite flavor of policy optimization is best, but in my experience most alpha just comes from making sure I’m asking the right question in the first place, and once I am asking the right question performance is quite good no matter what approach I’m taking.
  - Bitnotri 19 Aug 2025 6:23 UTC
    2 points
    3
    Parent
    willing to share prompts?