Write Good Enough Code, Quickly

At the start of my Ph.D. 6 months ago, I was generally wedded to writing “good code”. The kind of “good code” you learn in school and standard software engineering these days: object oriented, DRY, extensible, well-commented, and unit tested. I like writing “good code”—in undergrad I spent hours on relatively trivial assignments, continually refactoring to construct clean and intuitive abstractions. Part of programming’s appeal is this kind of aesthetic sensibility—there’s a deep pleasure in constructing a pseudo-platonic system of types, objects, and functions that all fit together to serve some end. More freedom than math, but more pragmatism than philosophy. This appreciation for the “art” of programming can align with more practical ends—beauty often coincides with utility. Expert programmers are attracted to Stripe because of their culture of craftsmanship, and Stripe promotes a culture of craftsmanship (in part, presumably) because “building multi-decadal abstractions” (in the words of Patrick Collison) is useful for the bottom line.

And this is all well and good, if (as is more often the case than not) you expect your code to be used in 1, 2, 10 years. But in research, this is often not the case! Projects typically last on the order of weeks to months, not years to decades. Moreover, projects typically involve a small number (often 1 or 0) of highly involved collaborators, as opposed to the large, fragmented teams typical of industry. Moreover, speed is paramount. Research is a series of bets, and you want to discover the outcome of the bet as fast as possible. Messy code might incur technical debt, but you don’t have to pay if you scrap the entire project.

I had heard advice like this going into my PhD, both in the context of research and product development generally (MVP, Ballmer Peak, etc). It took me a while to internalize it though, in part, I suspect, because there’s an art to writing “messy” code too. Writing error-prone spaghetti code is not the answer—you need stuff to work quickly to get results quickly. The goal is to write good enough code, efficiently, but learning what good enough means is a skill unto itself.

Principles for Good-Enough Code

Below is a first pass at some guiding principles. I focused on ML research in Python, but I suspect the lessons are generalizable

Future-t you is the target user
- where t is something like an exponential distribution with median 1 day.
Minimize indirection—Have as much of the code as is reasonable in a single notebook
- This is the kind of advice that’s horrible for freshman CS students, but probably helpful for first-year PhD students ^[1] Having everything in one place increases context—you can just read the program logic, without having to trace through various submodules and layers of abstraction. It also encourages you to constantly review code which otherwise might be tucked away, naturally helping you to catch errors, identify improvements, or notice additional axes of variation in the system.
Only refactor when you need to—but always refactor when you need to
- When you’ve had the impulse two or three times to pull something out into a separate function or object, do it. As a default though, be very suspicious of coding activity that isn’t directly doing the thing.
Use your context
- Lots of best practices in software engineering revolve around providing context and constraints to future readers and editors, in the form of comments, assertions, documentation, and type checks. Some of this will be useful for you 5 minutes after writing it, but a lot of it won’t. So don’t worry too much about using enums instead of strings, adding docstrings for each function, and avoiding magic numbers. Semantic naming and parse commenting are useful enough.
Copy and paste is your friend
- In keeping with minimizing indirection, it’s often better to reuse a component by copying and pasting rather than sharing it across functions/scripts. Not only does this improve context, but it also promotes decoupling. If you end up needing to modify the component for a particular usecase, you can do so without worrying about how the change will affect functionality elsewhere (the conserve of this is that if you want to make the same modification, you have to do it twice, so, as always, user discretion is required).
You’re still allowed to think—slow is smooth and smooth is fast
- When told to prioritize speed in coding, we often imagine the rogue hacker, wizzing away at a terminal , no time wasted without a keystroke. And sure, maybe 10x engineers operate something like this. But for mere mortals, it’s important to remember that you can still do a bit of planning before getting to work. For me, planning usually takes the form of pseudo-code comments, but a little diagram sketching and rubbing ducking won’t hurt either. The key is to efficiently execute an imperfect plan—and this requires having an imperfect plan to begin with.
Avoid unit tests—at least early on
- The most obvious case of trading speed for reliability. In research, you should be constructing your code incrementally, running it at each step in a REPL or notebook. By the time you’re done, you’ve basically covered the central use case (running the full experiment script), and don’t have to worry about arbitrarily users exploiting weird edge cases. You are the target user. And running the script is often the only (integration) test you need (do check tensor shapes though, ML debugging is hard and all).
Use an LLM
- This should be obvious. As of December 14th 2024, I’d recommend Curser with Sonnet 3.5 (though I occasionally use O1 to work through some math)

Again, all this advice assumes a baseline of “standard software engineering practices”—I want to help cure you of deontic commitments like never repeating yourself. But if you don’t need curing in the first place, you should probably reverse this advice.

My ML Research Workflow

With these principles in mind, I’ll walk through my current research workflow. My goal is to fluidly transition and forth from a rough experimental notebook to a full experiment pipeline with tracking, sweeps, and results visualization.

Initialize an empty python project with a project-specific virtual environment (I’d recommend poetry, which makes dependency and virtual environment management really seamless—dependency hell is a great way to get slowed down)
```
mkdir my-project
cd my-project
mkdir my_project
touch my_project/__init__.py
poetry init --no-iteraction
```
Install bare-minimum dependencies—numpy, pandas matplotlib, torch, and (to use a jupyter notebook) ipykernel.
```
poetry add numpy pandas torch matplotlib ipykernel
```
Create a notebook, and name it something like run_exp.ipynb, using your newly created virtual environment as the kernel.
Now, write your experiment code, as fast as possible, all in the single notebook. Go!
- This is where you (without loss of generality) load the dataset and model, play around with transforms, tokenization, dataloading, etc, check that shapes are as expected, and write the “for epoch in range(epochs)” loop
- Don’t worry too much about extensive metric logging with fancy experiment trackers like tensorboard or wandb—log the minimum amount of information (often with dictionaries and print statements) to convince you that training is roughly working as expected (but fine, tensorboard can be helpful here in plotting training curves in real time)
Once you have something half working (ideally have a semi-promising result you want to investigate further) clean things up a bit. Move components that you don’t expect to change too much (e.g. datasets, model backbone definitions) into separate submodes, and factor out especially eggregious repetitions .
Most importantly, create a Config object. Treat the config as the primary “interface” of the notebook—it should contain all the parameters that you foresee changing. In general, air on the side of including parameters, but don’t worry about being exhaustive (e.g. you probably don’t need to include Adam beta values)

@dataclass
class Config():
  lr: float=1e-3
  weight_decay: float=1e-4
  epochs: int = 5
  # ...

Using the config, setup a simple experiment tracking system. In general, use a datetime system rather than config-specific directories to start—your code and configs will change a lot early on, config file names can get long, and you don’t want to overwrite old experiments after making changes. Do make sure to log the serialized config in the experiment directory though.
- Again, feel free to use experiment tracking systems like tensorboard and wandb, but you can get a surprising amount of mileage out of nested collections, print statements, and matplotlib, and there are often benefits to “rolling your own” (c.f. minimize indirection, or more generally maximizing context)
```
@dataclass
class Config():
  #...
  exp_dir: str = f"output/{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
conf = Config()

# Note we use OmegaConfig - which we'll return to in the next step!
with open(f"{exp_dir}/config.yaml", "w") as f:
    OmegaConf.save(config=conf, f=f)

metrics = {}
# run experiments and log metrics ...
with open(f"{conf.exp_dir}/metrics.json", 'r') as f: 
  json.dump(metrics, f)
```
Now that we have a cleaned-up script with a config and experiment tracking, we can start to experiment more systematically with different settings. Initially, I try to run quick experiments in the notebook by manually changing configs (this is more feasible if you have flexible access to gpu), keeping you close to the code and allowing for rapid iteration.
Eventually though, you’ll want to run experiment sweeps, typically on a shared cluster managed by slurm. This requires submitting a slurm job—consisting of required resources and a command to execute. Since we can’t run notebooks directly, I use nbconvert to convert my experiment notebook into a runable script:
```
#!/bin/bash
NOTEBOOK_PATH="$1"
jupyter nbconvert --clear-output --inplace "$NOTEBOOK_PATH"
jupyter nbconvert --to script "$NOTEBOOK_PATH"
```
this also helps with version control of notebooks, at the cost of maintaining two copies of essentially the same file.
To vary experimental settings across sweeps, we’ll want to accept config overrides in the notebook/script. To do so, I use an is_notebook() function (found from to check whether to parse command line arguments, and OmegaConf to parse and merge the config overrides:

if not is_notebook():
    import sys 
    overrides = OmegaConf.from_cli(sys.argv[1:])
    conf_dict = OmegaConf.merge(OmegaConf.structured(conf), overrides)
    conf = Config(**conf_dict) # reinitialize for dot access

With this infrastructure in place, we can execute experimental sweeps. Typical practice is to create bash scripts with different settings, but preferring to work in python, I create a separate notebook, ( exp_sweeps.ipynb) constructing an experiment config that contains a subset of the full configuration parameters (remember, you are the user—you don’t need to enforce this subset with inheritance or type checks).
```
@dataclass 
class Experiment: 
  lr: float = 1e-3
  weight_decay: float = 1e-4
  exp_dir: str = None
  def __post_init__(self):
    # we use semantic directories for structured sweeps
    self.exp_dir = f"output/lr_{self.lr}_wd_{self.weight_decay}"

# contruct experiments 
from itertools import product
lrs = [1e-4, 1e-3, 1e-2]
wds = [1e-3, 1e-3]
experiments = [Experiment(lr, wd) for lr, wd in product(lrs, wds)]
```
When running sweeps, I tend to overwrite the default experiment directory with semantic directory names corresponding to the experiment config. While this sometimes introduces the problems I discussed above (namely overwriting prior experiments), it feels more appropriate in the sweep stage where we typically compare results to other results in the sweep, rather than earlier sweep iterations. And in cases where we want to preserve results that would otherwise be overwritten, we can just take care in doing so (by e.g. moving them to a different directory).

After constructing a list of experiment objects, I use submitit to launch experiments programmatically, converting the experiment configs to command line arguments:

def conf_to_args(conf: dict):
    args = []
    for key, value in conf.items():
        # check if value is an enum 
        if isinstance(value, Enum):
            value = value.name 
        elif value is None:
            value = ‘null’
        args.append(f”{key}={value}”)
    return args

def run_experiments(executor, experiments: list[Experiment], script_name: str):
    with executor.batch():
        jobs = []
        for exp in experiments:
            function = submitit.helpers.CommandFunction(
                [“python”, script_name] + conf_to_args(exp.__dict__)
            )
            jobs.append(executor.submit(function))
    return jobs
  
# example executor that runs locally
executor = submitit.AutoExecutor(folder=out_dir)
executor.update_parameters(timeout_min=60 * 48, mem_gb=16,gres=”gpu:1″)
jobs = run_experiments(executor, experiments, “run_exp.py”)

Once the experiments are completed, I load and analyze the results using the same experiment objects. In this way, data generation and analysis are tightly coupled—paper figures are defined in the same notebook where experiments are run

def get_exp_metrics(exp: Experiment):
    if not (exp.exp_dir / "metrics.json").exists():
        raise FileNotFoundError(f"Metrics file not found for {exp.exp_dir}")
    with open(exp.exp_dir / "metrics.json", "r") as f:
        exp_metrics = json.load(f)
    return exp_metrics
  
# load exp metrics after jobs are completed 
exp_metrics = [get_exp_metrics(exp) for exp in experiments]
# ... (analyze data, make figures, etc)

Mileage on this exact setup may vary, but thus far I’ve found it strikes a great balance between flexibility and efficiency. Most significantly, I’ve found my “ugh field” around moving from local experimental notebook to submitting cluster jobs has been substantially reduced.

Conclusion

So yeah, those are my tips and basic setup. Again, they apply most strongly to early stage research, and most weakly to developing large compressive pieces of infrastructure (including research infrastructure like PyTorch, Hugging Face, and Transformer-lens). In some sense, the core mistake is to assume that early stage research requires novel extensive research infrastructure^[2]. Developing open source infrastructure is, to a first approximation^[3] prosocial: the gains are largely born by other users. So by all means, develop nice open-source frameworks—the world will benefit from you. But if you have new research ideas that you’re eager to try out, the best approach is often to just try them ASAP.

Write Good Enough Code, Quickly

Principles for Good-Enough Code

My ML Research Workflow

Conclusion

Related Articles /​ Sources of Inspiration

Related Articles / Sources of Inspiration