Two Designs

Here’s a concrete example of two approaches to a software problem, each with different advantages, and circumstances that would lead me to choose one over the other.

Recently I wrote a job to create and save datasets that are used for downstream work. The code is run through a scheduler that will spin up a worker and execute, so it has to be decoupled from any upstream or downstream jobs.

The first design I considered is pretty standard: we tell the job which datasets we want, validate that the parameters needed for the dataset are present, and then build those datasets.

def main(make_dataset1: bool, make_dataset2: bool, make_dataset3: bool, a = None, b = None, c = None, d = None):
	if make_dataset1:
		assert a is not None and b is not None
		build_dataset1(a, b, d)
	if make_dataset2:
		assert c is not None and d is not None
		build_dataset2(c)
	if make_dataset3:
		assert b is not None and d is not None
		build_dataset3(b, d)

In the second approach, instead of telling the job which datasets to build, we pass it all the parameters we have, and the job builds whatever datasets it can.

def main(params: Dict):
	for dataset_builder in [build_dataset1, build_dataset2, build_dataset3]:
		try:
			dataset_builder(**params)
		except TypeError: # Error we get if arguments are missing
			print(f"Skipping job {job.__name__} due to missing parameters")

These are pretty different approaches, representing two poles of design – let’s call them explicit and pass-through. (In between versions exist too, but let’s ignore them for now.) What are some tradeoffs between them?

  • In the explicit version, code calling the job needs to know about the different datasets and parameters. This makes it tightly coupled to the details of the build_dataset functions, or at least their signatures. In the pass-through approach, the calling code can just submit the job to a scheduler – it doesn’t know anything about those functions, and doesn’t need to change when we modify them.

  • Similarly, the build_dataset functions all have different signatures, making it hard to abstract over them and requiring us to change the job code if the functions ever change. In the pass-through approach, we can call them all in the same way, and only need to change the job code if we add or remove a dataset.

  • In the first version, it’s easier to be explicit about exactly which datasets we want, and to fail loudly if one or more can’t be created. In the second, our intentions for which datasets to make aren’t found anywhere in the code, and it’s easier to quietly fail, leading to a downstream error.

In short, the explicit version is a bit safer , but suffers from excess coupling – we have to update many more parts of the code when things change.

I went with the pass-through approach. Why?

First, the safety offered by the explicit approach consists entirely of failing earlier. This is a good thing, but in my use case the pipeline takes about 10 minutes for the workers to start up, but then each job runs relatively quickly. In the explicit approach, we’ll get a loud failure right around when the workers spin up, which means 10 minutes in. In the pass-through approach, we get that failure about 12 minutes in. So the extra safety is pretty marginal in this case.

(If we were submitting to an always-running cluster, the extra safety would be much more valuable, because the failure times might look more like 30 seconds and 150 seconds, meaning the first gives us a 5x faster feedback loop instead of just a 20% improvement.)

Next, I expected the explicit version to give us a bunch of extra failures – since a common change would have to modify 3 parts of the code instead of 1, the odds that I would make a mistake were pretty high, and this mistake would result in a loud failure 10 minutes in.

(If this could be easily statically type checked and give us a compile-time error instead of a runtime error, that would would be a big point in favor of the explicit approach.)

Last, the the cost of having smart calling code was quite high for us – since our scheduler (AWS Batch) could only handle logic of the form “run job 2 after job 1”, it meant that I would essentially need to run another coordinating process, that would duplicate some but not all of the work of the scheduler. This adds another moving part that can fail, and ties up a development machine.

(If we had already been using a more capable scheduling framework like AirFlow, this wouldn’t be a big deal. But the one-time cost of setting up such a framework is pretty high, and I didn’t think it was worth it.)