When fine-tuning a model or creating a dataset or running and eval, save and look at some your data to understand what the model is learning / should learn.
If your infra (e.g. wandb) makes it mildly annoying, don’t just use wandb, just store json/jsonl locally
Check that your results and metrics make sense. You should have information rich plots (e.g. training curves, not just the final results), and make sure they are not “crazy” (e.g. does training degrade IID performance? Is the model shockingly bad? Do the metrics your are using have the right units? Do the metrics have the right behavior in some simple situations?).
Cache LLM responses! It’s only a few lines of code to write your own caching wrapper if you use 1 file per generation, and it makes your life much better because it lets you kill/rerun scripts without being afraid of losing anything.
Always have at least one experiment script that is slow and runs in the background, and an explore / plot script, that you use for dataset exploration and plotting.
If you have n slow steps (e.g. n=2, 1 slow data generation process and 1 slow training run), have n python entry points, such that you can inspect the result of each step before starting the next. Once the pipeline works, you can always write a script that runs them one after another.
In general, using plotting / printing results as the main debugging tool: the evolution of train/eval during training loss should look fine, the transcript should look reasonable, the distribution of probabilities should look reasonable, etc. If anything looks surprising, investigate by plotting / printing more stuff and looking at the places in the code/data that could have caused the issue.
Also make guesses about API cost / training speed, and check that your expectations are roughly correct near the start of the run (e.g. use tqdm on any slow for loop / gather). This lets you notice issues like putting things on the wrong device, not hitting the API cache, using a dataset that is too big for a de-risking experiment, etc.
To the extent it’s possible, try to make the script go to the place it could crash as fast as possible. For example, if a script crashes on the first backward pass after 2 minutes of tokenization, avoid always spending 2 minutes waiting for tokenization during debugging (e.g. tokenize on the fly or cache the tokenized dataset)
Do simple fast experiments on simple infra before doing big expensive experiments on complex infra
For fine-tuning / probing projects, I often get a lot of value out of doing a single-GPU run on a 1B model using a vanilla pytorch loop. That can often get signal in a few minutes and forces me to think through what different results could mean / whether there are boring reasons for surprising results.
If things go wrong, simplify even further: e.g. if I train a password-locked model and the with-password-perf is low, I would try training a model just on the strong trajectories
Use #%% python notebooks, they play nicely with copilot, code editors and git. And you need notebooks to do printing / plotting.
Use prompting experiments before doing SL. Do SL before doing RL. And you often don’t need SL or RL (e.g. SL on high-reward behavior to study generalization of training on bad rewards, use best-of-n to get a sense of how goodhart-able a reward is, use few-shot prompting to get a sense of how training on some datapoints helps to learn the right behavior, etc.)
Prompting experiments can happen in your notebook, they don’t always need to be clean and large-n.
Use as few libraries you don’t understand as possible. I like just using torch+huggingface+tqdm+matplotlib. For model-internals experiments, I often use either huggingface’s output_hidden_states or a pytorch hook. For saving data, I often just use json.
Mistakes to avoid:
Not sweeping learning rates when fine-tuning a model (I would usually sweep around the expected lr. For example Adam often expects something like lr=1e-4, so I would at least try something like 1e-5, 3e-5, 1e-4 and 3e-4, and then adjust during the next experiment based on train loss and val accuracy, using smaller gaps if I want better performance and not just a reasonable lr)
Not shuffling your dataset (it’s free, don’t rely on someone else shuffling your dataset)
This is important not just for training datasets, but also when taking n elements from an evaluation dataset—datasets like Alpaca are not shuffled, so taking the first n elements will no be the same as taking n random points from the dataset.
When distilling one model into another, not sampling at temperature 1 (sampling at temperature 1 and training on it is what should copy the original model probs)
When evaluating a model, only sampling at temperature 1 (sampling at a lower temperature often works better)
Using an instruction-tuned model without using the official prompt template (usually chat_apply_template)
Not including the end-of-sequence token when fine-tuning on an instruction-tuning dataset
Not sweeping in logspace when sweeping hyperparams like temperature, lr, etc.
Not training for long enough / on enough datapoints—fine-tune on at least 1k trajectories before giving up. Small <1k trajectories runs are not that informative.
Using a huggingface model forward pass with padding side left without prepare_inputs_for_generation
Using await in a for loop (in particular when awaiting LLMs calls, always use gather)
If your LLMs request benefits from prefix caching, not using a semaphore to make sure the queries with shared prefix happen at <5 minutes of interval
Using python’s built-in hash instead of md5 when you want a hash that is consistent over runs
Using random.Random(some object that is not a str or an int) (it sometimes uses the object’s id as seed, which varies between runs)
Not using vllm and other specialized inference libraries instead of huggingface model.generate (which can be 30x slower)
Using model.generate is fine for small mid-run evals. If you do this, at least batch your generations.
Some additional opinionated tips:
When fine-tuning a model or creating a dataset or running and eval, save and look at some your data to understand what the model is learning / should learn.
If your infra (e.g. wandb) makes it mildly annoying, don’t just use wandb, just store json/jsonl locally
Check that your results and metrics make sense. You should have information rich plots (e.g. training curves, not just the final results), and make sure they are not “crazy” (e.g. does training degrade IID performance? Is the model shockingly bad? Do the metrics your are using have the right units? Do the metrics have the right behavior in some simple situations?).
Cache LLM responses! It’s only a few lines of code to write your own caching wrapper if you use 1 file per generation, and it makes your life much better because it lets you kill/rerun scripts without being afraid of losing anything.
Always have at least one experiment script that is slow and runs in the background, and an explore / plot script, that you use for dataset exploration and plotting.
If you have n slow steps (e.g. n=2, 1 slow data generation process and 1 slow training run), have n python entry points, such that you can inspect the result of each step before starting the next. Once the pipeline works, you can always write a script that runs them one after another.
In general, using plotting / printing results as the main debugging tool: the evolution of train/eval during training loss should look fine, the transcript should look reasonable, the distribution of probabilities should look reasonable, etc. If anything looks surprising, investigate by plotting / printing more stuff and looking at the places in the code/data that could have caused the issue.
Also make guesses about API cost / training speed, and check that your expectations are roughly correct near the start of the run (e.g. use tqdm on any slow for loop / gather). This lets you notice issues like putting things on the wrong device, not hitting the API cache, using a dataset that is too big for a de-risking experiment, etc.
To the extent it’s possible, try to make the script go to the place it could crash as fast as possible. For example, if a script crashes on the first backward pass after 2 minutes of tokenization, avoid always spending 2 minutes waiting for tokenization during debugging (e.g. tokenize on the fly or cache the tokenized dataset)
Do simple fast experiments on simple infra before doing big expensive experiments on complex infra
For fine-tuning / probing projects, I often get a lot of value out of doing a single-GPU run on a 1B model using a vanilla pytorch loop. That can often get signal in a few minutes and forces me to think through what different results could mean / whether there are boring reasons for surprising results.
If things go wrong, simplify even further: e.g. if I train a password-locked model and the with-password-perf is low, I would try training a model just on the strong trajectories
Use #%% python notebooks, they play nicely with copilot, code editors and git. And you need notebooks to do printing / plotting.
Use prompting experiments before doing SL. Do SL before doing RL. And you often don’t need SL or RL (e.g. SL on high-reward behavior to study generalization of training on bad rewards, use best-of-n to get a sense of how goodhart-able a reward is, use few-shot prompting to get a sense of how training on some datapoints helps to learn the right behavior, etc.)
Prompting experiments can happen in your notebook, they don’t always need to be clean and large-n.
Use as few libraries you don’t understand as possible. I like just using torch+huggingface+tqdm+matplotlib. For model-internals experiments, I often use either huggingface’s output_hidden_states or a pytorch hook. For saving data, I often just use json.
Mistakes to avoid:
Not sweeping learning rates when fine-tuning a model (I would usually sweep around the expected lr. For example Adam often expects something like lr=1e-4, so I would at least try something like 1e-5, 3e-5, 1e-4 and 3e-4, and then adjust during the next experiment based on train loss and val accuracy, using smaller gaps if I want better performance and not just a reasonable lr)
Not shuffling your dataset (it’s free, don’t rely on someone else shuffling your dataset)
This is important not just for training datasets, but also when taking n elements from an evaluation dataset—datasets like Alpaca are not shuffled, so taking the first n elements will no be the same as taking n random points from the dataset.
When distilling one model into another, not sampling at temperature 1 (sampling at temperature 1 and training on it is what should copy the original model probs)
When evaluating a model, only sampling at temperature 1 (sampling at a lower temperature often works better)
Using an instruction-tuned model without using the official prompt template (usually chat_apply_template)
Not including the end-of-sequence token when fine-tuning on an instruction-tuning dataset
Not sweeping in logspace when sweeping hyperparams like temperature, lr, etc.
Not training for long enough / on enough datapoints—fine-tune on at least 1k trajectories before giving up. Small <1k trajectories runs are not that informative.
Using a huggingface model forward pass with padding side left without prepare_inputs_for_generation
Using await in a for loop (in particular when awaiting LLMs calls, always use gather)
If your LLMs request benefits from prefix caching, not using a semaphore to make sure the queries with shared prefix happen at <5 minutes of interval
Using python’s built-in hash instead of md5 when you want a hash that is consistent over runs
Using random.Random(some object that is not a str or an int) (it sometimes uses the object’s id as seed, which varies between runs)
Not using vllm and other specialized inference libraries instead of huggingface model.generate (which can be 30x slower)
Using model.generate is fine for small mid-run evals. If you do this, at least batch your generations.