This reminds me that not everyone knows what I know.
If you work with distributed systems, by which I mean any system that must pass information between multiple, tightly integrated subsystems, there is a well understood concept of maximum sustainable load and we know that number to be roughly 60% of maximum possible load for all systems.
I don’t have a link handy to show you the math, but the basic idea is that the probability that one subsystem will have to wait on another increases exponentially with the total load on the system and the load level that maximizes throughput (total amount of work done by the system over some period of time) comes in just above 60%. If you do less work you are wasting capacity (in terms of throughput); if you do more work you will gum up the works and waste time waiting even if all the subsystems are always busy.
We normally deal with this in engineering contexts, but as is so often the case this property will hold for basically anything that looks sufficiently like a distributed system. Thus the “operate at 60% capacity” rule of thumb will maximize throughput in lots of scenarios: assembly lines, service-oriented architecture software, coordinated work within any organization, an individual’s work (since it is normally made up of many tasks that information must be passed between with the topology being spread out over time rather than space), and perhaps most surprisingly an individual’s mind-body.
“Slack” is a decent way of putting this, but we can be pretty precise and say you need ~40% slack to optimize throughput: more and you tip into being “lazy”, less and you become “overworked”.
This comment is still real good. Some thoughts I’ve been reflecting on re: 60% rule:
I’ve roughly been living my life at 60% for the past year (after a few years of doing a lot of over-extension and burnout). It has… been generally good.
I do notice that, say, at the last NYC Winter Solstice, I was operating at like 105% capacity. And the result was a more polished, better thing than would have existed if I’d been operating at 60%. Locally, you do get more results if you burn your reserves, esp if you are working on something with a small number of moving parts. This is why it’s so easy to fall into the trap.
But, like, you can’t just burn your reserves all the time. It just doesn’t work.
And if you’re building a thing that is bigger than you and requires a bunch of people coordinating at once, I think it’s especially important to keep yourself _and_ your allies working at 60%. This is what maximizes the awesomeness-under-the-curve.
The main worry I have about my current paradigm, where I think I’m lacking: Most of my 60%-worth of effort I spend these days doesn’t really go into _improvement_ of my overall capacity, or in especially building new skills. A year-ish ago, when I dialed back my output to something sustainable, I think I cut into my self-improvement focus. (Before moving to Berkeley, I HAD focused efforts on raising the capacity of people around me, which had a similar effect, but I don’t think I’ve done that _much_ since moving here)
My good action of the day is to have fallen in the rabbit hole of discovering the justification behind your comment.
First, it’s more queueing theory than distributed systems theory (slightly pedantic, but I’m more used to the latter, which explained my lack of knowledge of this result).
Second, even if you look through Queueing theory resources, it’s not that obvious where to look. I’ve finally found a helpful blog post which basically explains how under basic models of queues the average latency behaves like 11−load, which leads to the following graph (utilization is used instead of load, but AFAIK these are the same things):
This post and a bunch of other places mentions 80% rather than 60%, but that’s not that important for the point IMO.
One thing I wonder is how this result changes with more complex queuing models, but I don’t have the time to look into it. Maybe this pdf (which also includes the maths for the mentioned derivation) has the answer.
This sounds very interesting. I’m curious if anyone has a reference for the 60% number? (a few simple google searches for “maximum sustainable load distributed system” and variants didn’t turn up anything)
This was a brand-new concept to me, and I suspect it’s going to immediately be useful to me. Thank you very much for taking the time to explain clearly.
Huh. That’s for a well-defined concrete goal outside the system that doesn’t require substantial revision by the system you’re evaluating. This suggests that if you see yourself operating at 60% of capacity, you’re at neutral (neither accumulating debt nor making progress) with respect to metacognitive slack. Fortunately, if your metacognition budget is very small, even a very small investment—say, the occasional sabbath—can boost the resources devoted to it by one or more orders of magnitude, which should show substantial results even with diminishing returns.
I saw this explained well in a book called The Phoenix Project. The book talks about what software development can learn from decades of manufacturing process improvements.
This blog post shows the graph presented by the book, which makes a similar but more general point to yours and further formalizes the Slack concept.
Seems it’s hard to pin down the source of this concept, but it apparently follows from Little’s Law.
I don’t know of a great way to phrase this so it doesn’t get mixed up with notions of personal productivity, but the basic idea here is that you are yourself a complex system made of many distributed parts that pass information around and so you should expect if you try to operate above 60% of your maximum capacity you’ll experience problems.
Take physical exercise, for example. If you are doing about 60% of what you are capable of doing you’ll probably be able to do it for a long time because you are staying safely below the point of exhausting energy reserves and damaging cells/fibers faster than they can be repaired. Of course this is a bit of a simplification, because different subsystems have different limits and you’ll run into problems if you work any subsystem over 60% capacity, so your limiting factor is probably not respiration but something related to repair and replacement, thus you may have to operate overall at less than 60% capacity to keep the constraining subsystem from being overworked. Thus you can, say, walk 20 miles at a slow pace no problem and no need to rest but will tire out and need to rest after running 5 miles at top speed.
Same sort of thing happens with mental activities, like if you concentrate too hard for too long you lose the ability to concentrate for a while but if you concentrate lightly you can do it for a long time (think trying to read in a noisy room vs. trying to read in a quiet room). It doesn’t really matter how this happens (ego depletion? homeostatic regulation encouraging you to meet other needs?), the point is there is something getting overworked that needs time to clear and recover.
To sneak in an extra observation here, it’s notable where this doesn’t happen or only rarely happens. Systems that need high uptime tend to be highly integrated such that information doesn’t have to be shared but instead contain a lot of mutual information. For example, the respiration system doesn’t share information between its parts so much as it is a tightly integrated whole that immediately responds to changes to any of its parts because it can have zero downtime so it has to keep working even when parts are degraded and being repaired. But high integration introduces quadratic complexity scaling with system size, so such tight integration is limited by how much complexity the system can manage without being subject to unexpected complex failures. Beyond some scaling limit of what evolution/engineering can manage, systems can only get more complex by being less integrated than their subsystems, and so you run into this capacity issue everywhere we are not so unlucky as to need 100% uptime.
Comparing peak to sustainable running speeds: the world marathon record’s average speed is 55% of the world 100m sprint record. And for the Olympic men’s qualifying times, the ratio is 54%. Both quite close to 60%
This reminds me that not everyone knows what I know.
If you work with distributed systems, by which I mean any system that must pass information between multiple, tightly integrated subsystems, there is a well understood concept of maximum sustainable load and we know that number to be roughly 60% of maximum possible load for all systems.
I don’t have a link handy to show you the math, but the basic idea is that the probability that one subsystem will have to wait on another increases exponentially with the total load on the system and the load level that maximizes throughput (total amount of work done by the system over some period of time) comes in just above 60%. If you do less work you are wasting capacity (in terms of throughput); if you do more work you will gum up the works and waste time waiting even if all the subsystems are always busy.
We normally deal with this in engineering contexts, but as is so often the case this property will hold for basically anything that looks sufficiently like a distributed system. Thus the “operate at 60% capacity” rule of thumb will maximize throughput in lots of scenarios: assembly lines, service-oriented architecture software, coordinated work within any organization, an individual’s work (since it is normally made up of many tasks that information must be passed between with the topology being spread out over time rather than space), and perhaps most surprisingly an individual’s mind-body.
“Slack” is a decent way of putting this, but we can be pretty precise and say you need ~40% slack to optimize throughput: more and you tip into being “lazy”, less and you become “overworked”.
This comment is still real good. Some thoughts I’ve been reflecting on re: 60% rule:
I’ve roughly been living my life at 60% for the past year (after a few years of doing a lot of over-extension and burnout). It has… been generally good.
I do notice that, say, at the last NYC Winter Solstice, I was operating at like 105% capacity. And the result was a more polished, better thing than would have existed if I’d been operating at 60%. Locally, you do get more results if you burn your reserves, esp if you are working on something with a small number of moving parts. This is why it’s so easy to fall into the trap.
But, like, you can’t just burn your reserves all the time. It just doesn’t work.
And if you’re building a thing that is bigger than you and requires a bunch of people coordinating at once, I think it’s especially important to keep yourself _and_ your allies working at 60%. This is what maximizes the awesomeness-under-the-curve.
The main worry I have about my current paradigm, where I think I’m lacking: Most of my 60%-worth of effort I spend these days doesn’t really go into _improvement_ of my overall capacity, or in especially building new skills. A year-ish ago, when I dialed back my output to something sustainable, I think I cut into my self-improvement focus. (Before moving to Berkeley, I HAD focused efforts on raising the capacity of people around me, which had a similar effect, but I don’t think I’ve done that _much_ since moving here)
My good action of the day is to have fallen in the rabbit hole of discovering the justification behind your comment.
First, it’s more queueing theory than distributed systems theory (slightly pedantic, but I’m more used to the latter, which explained my lack of knowledge of this result).
Second, even if you look through Queueing theory resources, it’s not that obvious where to look. I’ve finally found a helpful blog post which basically explains how under basic models of queues the average latency behaves like 11−load, which leads to the following graph (utilization is used instead of load, but AFAIK these are the same things):
This post and a bunch of other places mentions 80% rather than 60%, but that’s not that important for the point IMO.
One thing I wonder is how this result changes with more complex queuing models, but I don’t have the time to look into it. Maybe this pdf (which also includes the maths for the mentioned derivation) has the answer.
“helpful blog post” is down, here it is on wayback
This sounds very interesting. I’m curious if anyone has a reference for the 60% number? (a few simple google searches for “maximum sustainable load distributed system” and variants didn’t turn up anything)
This was a brand-new concept to me, and I suspect it’s going to immediately be useful to me. Thank you very much for taking the time to explain clearly.
I’ve come across this called Queueing Theory but I suspect it has been discovered many times.
Huh. That’s for a well-defined concrete goal outside the system that doesn’t require substantial revision by the system you’re evaluating. This suggests that if you see yourself operating at 60% of capacity, you’re at neutral (neither accumulating debt nor making progress) with respect to metacognitive slack. Fortunately, if your metacognition budget is very small, even a very small investment—say, the occasional sabbath—can boost the resources devoted to it by one or more orders of magnitude, which should show substantial results even with diminishing returns.
I saw this explained well in a book called The Phoenix Project. The book talks about what software development can learn from decades of manufacturing process improvements.
This blog post shows the graph presented by the book, which makes a similar but more general point to yours and further formalizes the Slack concept.
Seems it’s hard to pin down the source of this concept, but it apparently follows from Little’s Law.
Could you elaborate on this bit? Or maybe give an example of what you mean?
I don’t know of a great way to phrase this so it doesn’t get mixed up with notions of personal productivity, but the basic idea here is that you are yourself a complex system made of many distributed parts that pass information around and so you should expect if you try to operate above 60% of your maximum capacity you’ll experience problems.
Take physical exercise, for example. If you are doing about 60% of what you are capable of doing you’ll probably be able to do it for a long time because you are staying safely below the point of exhausting energy reserves and damaging cells/fibers faster than they can be repaired. Of course this is a bit of a simplification, because different subsystems have different limits and you’ll run into problems if you work any subsystem over 60% capacity, so your limiting factor is probably not respiration but something related to repair and replacement, thus you may have to operate overall at less than 60% capacity to keep the constraining subsystem from being overworked. Thus you can, say, walk 20 miles at a slow pace no problem and no need to rest but will tire out and need to rest after running 5 miles at top speed.
Same sort of thing happens with mental activities, like if you concentrate too hard for too long you lose the ability to concentrate for a while but if you concentrate lightly you can do it for a long time (think trying to read in a noisy room vs. trying to read in a quiet room). It doesn’t really matter how this happens (ego depletion? homeostatic regulation encouraging you to meet other needs?), the point is there is something getting overworked that needs time to clear and recover.
To sneak in an extra observation here, it’s notable where this doesn’t happen or only rarely happens. Systems that need high uptime tend to be highly integrated such that information doesn’t have to be shared but instead contain a lot of mutual information. For example, the respiration system doesn’t share information between its parts so much as it is a tightly integrated whole that immediately responds to changes to any of its parts because it can have zero downtime so it has to keep working even when parts are degraded and being repaired. But high integration introduces quadratic complexity scaling with system size, so such tight integration is limited by how much complexity the system can manage without being subject to unexpected complex failures. Beyond some scaling limit of what evolution/engineering can manage, systems can only get more complex by being less integrated than their subsystems, and so you run into this capacity issue everywhere we are not so unlucky as to need 100% uptime.
Comparing peak to sustainable running speeds: the world marathon record’s average speed is 55% of the world 100m sprint record. And for the Olympic men’s qualifying times, the ratio is 54%. Both quite close to 60%