Research engineering tips for SWEs. Starting from a more SWE-based paradigm on writing ‘good’ code, I’ve had to unlearn some stuff in order to hyper-optimise for research engineering speed. Here’s some stuff I now do that I wish I’d done starting out.
Use monorepos.
As far as possible, put all code in the same repository. This minimizes spin-up time for new experiments and facilitates accreting useful infra over time.
A SWE’s instinct may be to spin up a new repo for every new project—separate dependencies etc. But that will not be an issue in 90+% of projects and you pay the setup cost upfront, which is bad.
Experiment code as a journal.
By default, code for experiments should start off’ in an ‘experiments’ folder, with each sub-folder running 1 experiment.
I like structuring this as a journal / logbook. e.g. sub-folders can be titled YYYY-MM-DD-{experiment-name}. This facilitates subsequent lookup.
If you present / track your work in research slides, this creates a 1-1 correspondence between your results and the code that produces your results—great for later reproducibility
Each sub-folder should have a single responsibility; i.e running ONE experiment. Don’t be afraid to duplicate code between sub-folders.
Different people can have different experiment folders.
I think this is fairly unintuitive for a typical SWE, and would have benefited from knowing / adopting this earlier in my career.
Refactor less (or not at all).
Stick to simple design patterns. For one-off experiments, I use functions fairly frequently, and almost never use custom classes or more advanced design patterns.
Implement only the minimal necessary functionality. Learn to enjoy the simplicity of hardcoding things. YAGNI.
Refactor when—and only when—you need to or can think of a clear reason.
Being OCD about code style / aesthetic is not a good reason.
Adding functionality you don’t need right this moment is not a good reason.
Most of the time, your code will not be used more than once. Writing a good version doesn’t matter.
Good SWE practices. There are still a lot of things that SWEs do that I think researchers should do, namely:
Use modern IDEs (Cursor). Use linters to check code style (Ruff, Pylance) and fix where necessary. The future-you who has to read your code will thank you.
Write functions with descriptive names, type hints, docstrings. Again, the future-you who has to read your code will thank you.
Unit tests for critical components. If you use a utility a lot, and it’s pretty complex, it’s worth refactoring out and testing. The future-you who has to debug your code will thank you.
Gold star if you also use Github Actions to run the unit test each time new code is committed, ensuring main always has working code.
Caveat: SWEs probably over-test code for weird edge cases. There are fewer edge cases in research since you’re the primary user of your own code.
Pull requests. Useful to group a bunch of messy commits into a single high-level purpose and commit that to main. Makes your commit history easier to read.
My current setup
Cursor + Claude for writing code quickly
Ruff + Pyright as Cursor extensions for on-the-go linting.
PDM + UV for Python dependency management
Collaborate via PRs. Sometimes you’ll need to work with other people in the same codebase. Here, only make commits through PRs and ask for review before merging. It’s more important here to apply ‘Good SWE practices’ as described above.
Good question! These practices are mostly informed by doing empirical AI safety research and mechanistic interpretability research. These projects emphasize fast initial exploratory sprints, with later periods of ‘scaling up’ to improve rigor. Sometimes most of the project is in exploratory mode, so speed is really the key objective.
I will grant that in my experience, I’ve seldom had to build complex pieces of software from the ground up, as good libraries already exist.
That said, I think my practices here are still compatible with projects that require more infra. In these projects, some of the work is building the infra, and some of the work is doing experiments using the infra. My practices will apply to the second kind of work, and typical SWE practices / product management practices will apply to the first kind of work.
I’m a bit skeptical, there’s a reasonable amount of passed-down wisdom I’ve heard claiming (I think justifiably) that
If you write messy code, and say “I’ll clean it later” you probably won’t. So insofar as you eventually want to discover something others build upon, you should write it clean from the start.
Clean code leads to easier extensibility, which seems pretty important eg if you want to try a bunch of different small variations on the same experiment.
Clean code decreases the number of bugs and the time spent debugging. This seems especially useful insofar as you are trying to rule-out hypotheses with high confidence, or prove hypotheses with high confidence.
Generally (this may be double-counting 2 and 3), paradoxically, clean code is faster rather than dirty code.
You say you came from a more SWE based paradigm though, so you probably know all this already.
Yeah, I agree with all this. My main differences are:
I think it’s fine to write a messy version initially and then clean it up when you need to share it with someone else.
By default I write “pretty clean” code, insofar as this can be measured with linters, because this increases readability-by-future-me.
Generally i think there may be a Law of Opposite Advice type effect going on here, so I’ll clarify where I expect this advice to be useful:
You’re working on a personal project and don’t expect to need to share much code with other people.
You started from a place of knowing how to write good code, and could benefit from relaxing your standards slightly to optimise for ‘hacking’. (It’s hard to realise this by yourself—pair programming was how I discovered this)
Pull requests. Useful to group a bunch of messy commits into a single high-level purpose and commit that to main. Makes your commit history easier to read.
You can also squash multiple commits without using PRs. In fact, if someone meticulously edited their commit history for a PR to be easy-to-follow and the changes in each commit are grouped based on them being some higher level logical single unit of change, squashing their commits can be actively bad, since now you are destroying the structure and making the history less readable by making a single messy commit.
With most SWEs when I try to get them to create nicer commit histories, I get pushback. Sure, not knowing the tools (git add -p and git rebase -i mostly tbh.) can be a barrier, but showing them nice commit histories does not motivate them to learn the tools used to create them. They don’t seem to see the value in a nice commit history.[1]
Which makes me wonder: why do you advocate for putting any effort into the git history for research projects (saying that “It’s more important here to apply ‘Good SWE practices’”), when even 99% of SWEs don’t follow good practices here? (Is looking back at the history maybe more important for research than for SWE, as you describe research code being more journal-like?)
Which could maybe be because they also don’t know the tools that can extract value from a nice commit history? E.g. using git blame or git bisect is a much more pleasant experience with a nice history.
IMO it’s mainly useful when collaborating with people on critical code, since it helps you clearly communicate the intent of the changes. Also you can separate out anything which wasn’t strictly necessary. And having it in a PR to main makes it easy to revert later if the change turned out to be bad.
If you’re working by yourself or if the code you’re changing isn’t very critical, it’s probably not as important
Research engineering tips for SWEs. Starting from a more SWE-based paradigm on writing ‘good’ code, I’ve had to unlearn some stuff in order to hyper-optimise for research engineering speed. Here’s some stuff I now do that I wish I’d done starting out.
Use monorepos.
As far as possible, put all code in the same repository. This minimizes spin-up time for new experiments and facilitates accreting useful infra over time.
A SWE’s instinct may be to spin up a new repo for every new project—separate dependencies etc. But that will not be an issue in 90+% of projects and you pay the setup cost upfront, which is bad.
Experiment code as a journal.
By default, code for experiments should start off’ in an ‘experiments’ folder, with each sub-folder running 1 experiment.
I like structuring this as a journal / logbook. e.g. sub-folders can be titled YYYY-MM-DD-{experiment-name}. This facilitates subsequent lookup.
If you present / track your work in research slides, this creates a 1-1 correspondence between your results and the code that produces your results—great for later reproducibility
Each sub-folder should have a single responsibility; i.e running ONE experiment. Don’t be afraid to duplicate code between sub-folders.
Different people can have different experiment folders.
I think this is fairly unintuitive for a typical SWE, and would have benefited from knowing / adopting this earlier in my career.
Refactor less (or not at all).
Stick to simple design patterns. For one-off experiments, I use functions fairly frequently, and almost never use custom classes or more advanced design patterns.
Implement only the minimal necessary functionality. Learn to enjoy the simplicity of hardcoding things. YAGNI.
Refactor when—and only when—you need to or can think of a clear reason.
Being OCD about code style / aesthetic is not a good reason.
Adding functionality you don’t need right this moment is not a good reason.
Most of the time, your code will not be used more than once. Writing a good version doesn’t matter.
Good SWE practices. There are still a lot of things that SWEs do that I think researchers should do, namely:
Use modern IDEs (Cursor). Use linters to check code style (Ruff, Pylance) and fix where necessary. The future-you who has to read your code will thank you.
Write functions with descriptive names, type hints, docstrings. Again, the future-you who has to read your code will thank you.
Unit tests for critical components. If you use a utility a lot, and it’s pretty complex, it’s worth refactoring out and testing. The future-you who has to debug your code will thank you.
Gold star if you also use Github Actions to run the unit test each time new code is committed, ensuring
main
always has working code.Caveat: SWEs probably over-test code for weird edge cases. There are fewer edge cases in research since you’re the primary user of your own code.
Pull requests. Useful to group a bunch of messy commits into a single high-level purpose and commit that to
main
. Makes your commit history easier to read.My current setup
Cursor + Claude for writing code quickly
Ruff + Pyright as Cursor extensions for on-the-go linting.
PDM + UV for Python dependency management
Collaborate via PRs. Sometimes you’ll need to work with other people in the same codebase. Here, only make commits through PRs and ask for review before merging. It’s more important here to apply ‘Good SWE practices’ as described above.
I think “refactor less” is bad advice for substantial shared infrastructure. It’s good advice only for your personal experiment code.
This seems likely to depend on your preferred style of research, so what is your preferred style of research?
Good question! These practices are mostly informed by doing empirical AI safety research and mechanistic interpretability research. These projects emphasize fast initial exploratory sprints, with later periods of ‘scaling up’ to improve rigor. Sometimes most of the project is in exploratory mode, so speed is really the key objective.
I will grant that in my experience, I’ve seldom had to build complex pieces of software from the ground up, as good libraries already exist.
That said, I think my practices here are still compatible with projects that require more infra. In these projects, some of the work is building the infra, and some of the work is doing experiments using the infra. My practices will apply to the second kind of work, and typical SWE practices / product management practices will apply to the first kind of work.
I’m a bit skeptical, there’s a reasonable amount of passed-down wisdom I’ve heard claiming (I think justifiably) that
If you write messy code, and say “I’ll clean it later” you probably won’t. So insofar as you eventually want to discover something others build upon, you should write it clean from the start.
Clean code leads to easier extensibility, which seems pretty important eg if you want to try a bunch of different small variations on the same experiment.
Clean code decreases the number of bugs and the time spent debugging. This seems especially useful insofar as you are trying to rule-out hypotheses with high confidence, or prove hypotheses with high confidence.
Generally (this may be double-counting 2 and 3), paradoxically, clean code is faster rather than dirty code.
You say you came from a more SWE based paradigm though, so you probably know all this already.
Yeah, I agree with all this. My main differences are:
I think it’s fine to write a messy version initially and then clean it up when you need to share it with someone else.
By default I write “pretty clean” code, insofar as this can be measured with linters, because this increases readability-by-future-me.
Generally i think there may be a Law of Opposite Advice type effect going on here, so I’ll clarify where I expect this advice to be useful:
You’re working on a personal project and don’t expect to need to share much code with other people.
You started from a place of knowing how to write good code, and could benefit from relaxing your standards slightly to optimise for ‘hacking’. (It’s hard to realise this by yourself—pair programming was how I discovered this)
You can also squash multiple commits without using PRs. In fact, if someone meticulously edited their commit history for a PR to be easy-to-follow and the changes in each commit are grouped based on them being some higher level logical single unit of change, squashing their commits can be actively bad, since now you are destroying the structure and making the history less readable by making a single messy commit.
With most SWEs when I try to get them to create nicer commit histories, I get pushback. Sure, not knowing the tools (
git add -p
andgit rebase -i
mostly tbh.) can be a barrier, but showing them nice commit histories does not motivate them to learn the tools used to create them. They don’t seem to see the value in a nice commit history.[1]Which makes me wonder: why do you advocate for putting any effort into the git history for research projects (saying that “It’s more important here to apply ‘Good SWE practices’”), when even 99% of SWEs don’t follow good practices here? (Is looking back at the history maybe more important for research than for SWE, as you describe research code being more journal-like?)
Which could maybe be because they also don’t know the tools that can extract value from a nice commit history? E.g. using
git blame
orgit bisect
is a much more pleasant experience with a nice history.IMO it’s mainly useful when collaborating with people on critical code, since it helps you clearly communicate the intent of the changes. Also you can separate out anything which wasn’t strictly necessary. And having it in a PR to main makes it easy to revert later if the change turned out to be bad.
If you’re working by yourself or if the code you’re changing isn’t very critical, it’s probably not as important