This is really, really hard to internalize. The default is to pay uniformly less attention to everything, e.g. switch to skimming every PR rather than randomly reviewing a few in detail. But that default means you lose a valuable feedback loop, while spot checking even 10% sustains it.
I’ve seen this scale to 100 person companies, and I think it scales much further.
I would describe the problem as follows: to reliably hire people who are 90th percentile at a skill, you (or someone you trust) needs to be at least 75th percentile at that skill. To hire 99th percentile, you need to be at least 90th percentile. To avoid hiring total charlatans, you need to be 25th percentile. And so on.
25th percentile is a relatively small investment! 75th is more, but still often 1-2 weeks depending on the domain. It’s almost certainly worth it to make sure your team has at least one person you trust who’s 75th percentile at any task you repeatedly hire for, even if you’re hiring contractors. And if it’s a key competency for the company, it can be worth the work to get to 90th percentile so you can hire really outstanding people in that domain.
I’d say the vast majority of cases I know of, within the US, involve firing too late rather than too early. The reason is straightforward: it sucks to fire people, knowing that you’ll potentially have a severe negative impact on their life, so most managers will put it off as long as possible.
Broadly agree. One thing I’ll add is that you should structure a “piece” around your points of highest uncertainty, and a common mistake I see is for companies to iterate on the wrong thing. Real examples from my career:
If you’re trying to make a large, offline data pipeline faster, then focusing on deploying to users is much less useful and typically slower than testing different performance improvements internally (and eventually pushing the best design).
Conversely, if you’re uncertain whether people will want a product in the first place, try selling preorders first
If your major uncertainty is whether certain libraries will do what you want, start by building out those components
If you’re trying to quickly get better at public speaking, practice a lot on your own! There are a lot of obvious errors and habits you can iron out. External feedback is more valuable, but also a lot more expensive – you can practice 10 5-minute talks in an hour by yourself, while you’ll probably only be able to do one in a Toastmasters group
If you’re a hedge fund, it’s worth quickly backtesting a lot of algorithms written in a non-production way (e.g. they run slowly and use a lot of memory) rather than insisting that everything be written to production standards before ever testing it
Even in real assembly lines at physical factories, I get the impression that generalization has often worked well because it gives you the ability to change your process. Toyota is/was considered best-in-class, and a major innovation of theirs was having workers rotate across different areas, becoming more generalized and more able to suggest improvements to the overall system, with some groups rotating every 2 hours[1].
Tesla famously reduced automation around 2018 even when the marginal costs were lower than human operators, again because the lost flexibility wasn’t worth it.[2] Though it’s worth noting they started investing more in robots again in recent years, presumably when their process was more solidified[3].
When I write code, I try to make most of it data to data transformations xor code that only takes in a piece of data and produces some effect (such as writing to a database.) This significantly narrows the search space of a lot of bugs: either the data is wrong, or the do-things-with-data code is wrong.
There are a lot of tricks in this reference class, where you try to structure your code to constrain the spaces where possible bugs can appear. Another example: when dealing with concurrency/parallelism, write the majority of your functions to operate on a single thread. Then have a separate piece of logic that coordinates workers/parallelism/etc. This is much easier to deal with than code that mixes parallelism and nontrivial logic.
Based on what you described, writing code that constrains the bug surface area to begin with sounds like the next step – and related, figuring out places where your codebase already does that, or places where it doesn’t do that but really should.
For personal relationships, mitigating my worst days has been more important than improving the average.
For work, all that’s really mattered is my really good days, and it’s been more productive to try and invest time in having more great days or using them well than to bother with even the average days.
I really enjoyed this study. I wish it weren’t so darn expensive, because I would love to see a dozen variations of this.
I still think I’m more productive with LLMs since Claude Code + Opus 4.0 (and have reasonably strong data points), but this does push me further in the direction of using LLMs only surgically rather than for everything, and towards recommending relatively restricted LLM use at my company.
It’s really useful to ask the simple question “what tests could have caught the most costly bugs we’ve had?”
At one job, our code had a lot of math, and the worst bugs were when our data pipelines ran without crashing but gave the wrong numbers, sometimes due to weird stuff like “a bug in our vendor’s code caused them to send us numbers denominated in pounds instead of dollars”. This is pretty hard to catch with unit tests, but we ended up applying a layer of statistical checks that ran every hour or so and raised an alert if something was anomalous, and those alerts probably saved us more money than all other tests combined.
There was a serious bug in this post that invalidated the results, so I took it down for a while. The bug has now been fixed and the posted results should be correct.
The Mathematical Theory of Communication by Shannon and Weaver. It’s an extended version of Shannon’s original paper that established Information Theory, with some extra explanations and background. 144 pages.
Atiyah & McDonald’s Introduction to Commutative Algebra fits. It’s 125 pages long, and it’s possible to do all the exercises in 2-3 weeks – I did them over winter break in preparation for a course.
Lang’s Algebra and Eisenbud’s Commutative Algebra are both supersets of Atiyah & McDonald, I’ve studied each of those as well and thought A&M was significantly better.
Unfortunately, I think it isn’t very compatible with the way management works at most companies. Normally there’s pressure to get your tickets done quickly, which leaves less time for “refactor as you go”.
I’ve heard this a lot, but I’ve worked at 8 companies so far, and none of them have had this kind of time pressure. Is there a specific industry or location where this is more common?
A big piece is that companies are extremely siloed by default. It’s pretty easy for a team to improve things in their silo, it’s significantly harder to improve something that requires two teams, it’s nearly impossible to reach beyond that.
Uber is particularly siloed, they have a huge number of microservices with small teams, at least according to their engineering talks on youtube. Address validation is probably a separate service from anything related to maps, which in turn is separate from contacts.
Because of silos, companies have to make an extraordinary effort to actually end up with good UX. Apple was an example of these, where it was literally driven by the founder & CEO of the company. Tumblr was known for this as well. But from what I heard, Travis was more of a logistics person than a UX person, etc.
(I don’t think silos explain the bank validation issue)
Smelling ingredients & food is a good way to develop intuition about how things will taste when combined
Salt early is generally much better than salt late
Data Science:
Interactive environments like Jupyter notebooks are a huge productivity win, even with their disadvantages
Automatic code reloading makes Jupyter much more productive (e.g. autoreload for Python, or Revise for Julia)
Bootstrapping gives you fast, accurate statistics in a lot of areas without needing to be too precise about theory
Programming:
Do everything in a virtual environment or the equivalent for your language. Even if you use literally one environment on your machine, the tooling around these tends to be much better
Have some form of reasonably accurate, reasonably fast feedback loop(s). Types, tests, whatever – the best choice depends a lot on the problem domain. But the worst default is no feedback loop
Ping-pong:
People adapt to your style very rapidly, even within a single game. Learn 2-3 complementary styles and switch them up when somebody gets used to one
Friendship:
Set up easy, default ways to interact with your friends, such as getting weekly coffees, making it easy for them to visit, hosting board game nights etc.
Take notes on what your friends like
When your friends have persistent problems, take notes on what they’ve tried. When you hear something they haven’t tried, recommend it. This is both practical and the fact that you’ve customized it is generally appreciated
Conversations:
Realize that small amounts of awkwardness, silence etc. are generally not a problem. I was implicitly following a strategy that tried to absolutely minimize awkwardness for a long time, which was a bad idea
using vector syntax is much faster than loops in Python
To generalize this slightly, using Python to call C/C++ is generally much faster than pure Python. For example, built-in operations in Pandas tend to be pretty fast, while using .apply() is usually pretty slow.
SatvikBeri
I’ve seen this scale to 100 person companies, and I think it scales much further.
I would describe the problem as follows: to reliably hire people who are 90th percentile at a skill, you (or someone you trust) needs to be at least 75th percentile at that skill. To hire 99th percentile, you need to be at least 90th percentile. To avoid hiring total charlatans, you need to be 25th percentile. And so on.
25th percentile is a relatively small investment! 75th is more, but still often 1-2 weeks depending on the domain. It’s almost certainly worth it to make sure your team has at least one person you trust who’s 75th percentile at any task you repeatedly hire for, even if you’re hiring contractors. And if it’s a key competency for the company, it can be worth the work to get to 90th percentile so you can hire really outstanding people in that domain.
I’d say the vast majority of cases I know of, within the US, involve firing too late rather than too early. The reason is straightforward: it sucks to fire people, knowing that you’ll potentially have a severe negative impact on their life, so most managers will put it off as long as possible.
Broadly agree. One thing I’ll add is that you should structure a “piece” around your points of highest uncertainty, and a common mistake I see is for companies to iterate on the wrong thing. Real examples from my career:
If you’re trying to make a large, offline data pipeline faster, then focusing on deploying to users is much less useful and typically slower than testing different performance improvements internally (and eventually pushing the best design).
Conversely, if you’re uncertain whether people will want a product in the first place, try selling preorders first
If your major uncertainty is whether certain libraries will do what you want, start by building out those components
If you’re trying to quickly get better at public speaking, practice a lot on your own! There are a lot of obvious errors and habits you can iron out. External feedback is more valuable, but also a lot more expensive – you can practice 10 5-minute talks in an hour by yourself, while you’ll probably only be able to do one in a Toastmasters group
If you’re a hedge fund, it’s worth quickly backtesting a lot of algorithms written in a non-production way (e.g. they run slowly and use a lot of memory) rather than insisting that everything be written to production standards before ever testing it
Even in real assembly lines at physical factories, I get the impression that generalization has often worked well because it gives you the ability to change your process. Toyota is/was considered best-in-class, and a major innovation of theirs was having workers rotate across different areas, becoming more generalized and more able to suggest improvements to the overall system, with some groups rotating every 2 hours[1].
Tesla famously reduced automation around 2018 even when the marginal costs were lower than human operators, again because the lost flexibility wasn’t worth it.[2] Though it’s worth noting they started investing more in robots again in recent years, presumably when their process was more solidified[3].
[1]: https://michelbaudin.com/2024/02/07/toyotas-job-rotation-policy/
[2]: https://theconversation.com/teslas-problem-overestimating-automation-underestimating-humans-95388
[3]: https://newo.ai/tesla-optimus-robots-revolutionize-manufacturing/
When I write code, I try to make most of it data to data transformations xor code that only takes in a piece of data and produces some effect (such as writing to a database.) This significantly narrows the search space of a lot of bugs: either the data is wrong, or the do-things-with-data code is wrong.
There are a lot of tricks in this reference class, where you try to structure your code to constrain the spaces where possible bugs can appear. Another example: when dealing with concurrency/parallelism, write the majority of your functions to operate on a single thread. Then have a separate piece of logic that coordinates workers/parallelism/etc. This is much easier to deal with than code that mixes parallelism and nontrivial logic.
Based on what you described, writing code that constrains the bug surface area to begin with sounds like the next step – and related, figuring out places where your codebase already does that, or places where it doesn’t do that but really should.
I think this is the Sanderson post: https://wob.coppermind.net/events/529/#e16670
For personal relationships, mitigating my worst days has been more important than improving the average.
For work, all that’s really mattered is my really good days, and it’s been more productive to try and invest time in having more great days or using them well than to bother with even the average days.
I really enjoyed this study. I wish it weren’t so darn expensive, because I would love to see a dozen variations of this.
I still think I’m more productive with LLMs since Claude Code + Opus 4.0 (and have reasonably strong data points), but this does push me further in the direction of using LLMs only surgically rather than for everything, and towards recommending relatively restricted LLM use at my company.
It’s really useful to ask the simple question “what tests could have caught the most costly bugs we’ve had?”
At one job, our code had a lot of math, and the worst bugs were when our data pipelines ran without crashing but gave the wrong numbers, sometimes due to weird stuff like “a bug in our vendor’s code caused them to send us numbers denominated in pounds instead of dollars”. This is pretty hard to catch with unit tests, but we ended up applying a layer of statistical checks that ran every hour or so and raised an alert if something was anomalous, and those alerts probably saved us more money than all other tests combined.
There was a serious bug in this post that invalidated the results, so I took it down for a while. The bug has now been fixed and the posted results should be correct.
One sort-of counterexample would be The Unreasonable Effectiveness of Mathematics in the Natural Sciences, where a lot of Math has been surprisingly accurate even when the assumptions where violated.
The Mathematical Theory of Communication by Shannon and Weaver. It’s an extended version of Shannon’s original paper that established Information Theory, with some extra explanations and background. 144 pages.
Atiyah & McDonald’s Introduction to Commutative Algebra fits. It’s 125 pages long, and it’s possible to do all the exercises in 2-3 weeks – I did them over winter break in preparation for a course.
Lang’s Algebra and Eisenbud’s Commutative Algebra are both supersets of Atiyah & McDonald, I’ve studied each of those as well and thought A&M was significantly better.
I’ve heard this a lot, but I’ve worked at 8 companies so far, and none of them have had this kind of time pressure. Is there a specific industry or location where this is more common?
A big piece is that companies are extremely siloed by default. It’s pretty easy for a team to improve things in their silo, it’s significantly harder to improve something that requires two teams, it’s nearly impossible to reach beyond that.
Uber is particularly siloed, they have a huge number of microservices with small teams, at least according to their engineering talks on youtube. Address validation is probably a separate service from anything related to maps, which in turn is separate from contacts.
Because of silos, companies have to make an extraordinary effort to actually end up with good UX. Apple was an example of these, where it was literally driven by the founder & CEO of the company. Tumblr was known for this as well. But from what I heard, Travis was more of a logistics person than a UX person, etc.
(I don’t think silos explain the bank validation issue)
Cooking:
Smelling ingredients & food is a good way to develop intuition about how things will taste when combined
Salt early is generally much better than salt late
Data Science:
Interactive environments like Jupyter notebooks are a huge productivity win, even with their disadvantages
Automatic code reloading makes Jupyter much more productive (e.g. autoreload for Python, or Revise for Julia)
Bootstrapping gives you fast, accurate statistics in a lot of areas without needing to be too precise about theory
Programming:
Do everything in a virtual environment or the equivalent for your language. Even if you use literally one environment on your machine, the tooling around these tends to be much better
Have some form of reasonably accurate, reasonably fast feedback loop(s). Types, tests, whatever – the best choice depends a lot on the problem domain. But the worst default is no feedback loop
Ping-pong:
People adapt to your style very rapidly, even within a single game. Learn 2-3 complementary styles and switch them up when somebody gets used to one
Friendship:
Set up easy, default ways to interact with your friends, such as getting weekly coffees, making it easy for them to visit, hosting board game nights etc.
Take notes on what your friends like
When your friends have persistent problems, take notes on what they’ve tried. When you hear something they haven’t tried, recommend it. This is both practical and the fact that you’ve customized it is generally appreciated
Conversations:
Realize that small amounts of awkwardness, silence etc. are generally not a problem. I was implicitly following a strategy that tried to absolutely minimize awkwardness for a long time, which was a bad idea
To generalize this slightly, using Python to call C/C++ is generally much faster than pure Python. For example, built-in operations in Pandas tend to be pretty fast, while using
.apply()is usually pretty slow.I didn’t know about that, thanks!
I found Loop Hero much better with higher speed, which you can fix by modifying a
variables.inifile: https://www.pcinvasion.com/loop-hero-speed-mod/