I generally agree with the problem described, and I agree that “small amount of well-defined failure modes” is a necessary condition for the error codes to be useful. But that doesn’t really tell us how to come up with a good set of errors. I’ll suggest a more constructive error ontology.
When an error occurs, the programmer using the library mostly needs to know:
Is it my mistake, a bug in the library, or a hardware-level problem (e.g. connection issue)?
If it’s my mistake, what did I do wrong?
Why these questions? Because these are the questions which determine what the programmer needs to do next. If you really want to keep the list of errors absolutely minimal, then three errors is not a bad starting point: bad input, internal bug, hardware issue. Many libraries won’t even need all of these—e.g. non-network libraries probably don’t need to worry about hardware issues at all.
Which of the three categories can benefit from more info, and what kind of additional info?
First, it is almost never a good idea to give more info on internal bugs, other than logging it somewhere for the library’s maintainers to look at. Users of the library will very rarely care about why the library is broken; simply establish that it is indeed a bug and then move on.
For hardware problems, bad connection is probably the most ubiquitous. The user mostly just needs to know whether it’s really a bad connection (e.g. comcast having a bad day) or really the user’s mistake (e.g. input the wrong credentials). Most libraries probably only need at most one actual hardware error, but user mistakes masquerading as hardware problems are worth looking out for separately.
That just leaves user mistakes, a.k.a. bad inputs. This is the one category where it makes sense to give plenty of detail, because the user needs to know what to fix. Of course, communication is a central problem here: the whole point of this class of errors is to communicate to the programmer exactly how their input is flawed. So, undocumented numerical codes aren’t really going to help.
(Amusingly, when I hit “submit” for this comment, I got “Network error: Failed to fetch”. This error did its job: I immediately knew what the problem was, and what I needed to do to fix it.)
The original piece continues where this post leaves off to discuss how this logic applies inside the firm. The main takeaway there is that most firms do not have competitive internal resource markets, so each part of the company usually optimizes for some imperfect metric. The better those metrics approximate profit in competitive markets, the closer the company comes to maximizing overall profit. This model is harder to quantify, but we can predict that e.g. deep production pipelines will be less efficient than broad pipelines.
I’m still writing the piece on non-equilibrium markets. The information we get on how the market is out of equilibrium is rather odd, and doesn’t neatly map to any other algorithm I know. The closest analogue would be message-passing algorithms for updating a Bayes net when new data comes in, but that analogy is more aesthetic than formal.
“Price = derivative” is certainly well-known. I haven’t seen anyone else extend the connection to backprop before, but there’s no way I’m first person to think of it.
Ok, that sounds right.
At what point is the data used?
One hypothesis for why current hiring practices seem not-very-good: there’s usually no feedback mechanism. There are sometimes obvious cases, where a hire ended up being really good or really bad, but there’s no fine-grained way to measure how someone is doing—let alone how much value they add to the organization.
Any prediction market proposal to fix hiring first needs to solve that problem. You need a metric for performance, so you have a ground truth to use for determining bet pay-offs. And to work in practice, that metric also needs to get around Goodhart’s Law somehow. (See here for a mathy explanation of roughly this problem.)
Now for the flip side: if we had an accurate, Goodhart-proof metric for employee performance, then we probably wouldn’t need a fancy prediction market to utilize it. Don’t get me wrong, a prediction market would be a very fast and efficient way to incorporate all the relevant info. But even a traditional HR department can probably figure out what they need to do in order to improve their metric, once they have a metric to improve.
That sampling method sounds like it should work, assuming it’s all implemented correctly (not sure what method you’re using to sample from the posterior distribution of μ, σ).
Worst case in a million being dominated by parameter uncertainty definitely makes sense, given the small sample size and the rate at which those distributions fall off.
Having ~ten data points makes this way more interesting. That’s exactly the kind of problem that I specialize in.
For the log-normal distribution, it should be possible to do the integral for P(data|model) explicitly. The integral is tractable for the normal distribution—it comes out proportional to a power of the sample variance—so just log-transform the data and use that. If you write down the integral for normal-distributed data explicitly and plug it into wolframalpha or something, it should be able to handle it. That would circumvent needing to sample μ and σ.
I don’t know if there’s a corresponding closed form for Birnbaum-Saunders; I had never even heard of it before this. The problem is still sufficiently low-dimensional that it would definitely be tractable computationally, but it would probably be a fair bit of work to code.
I’ll walk through how I’d analyze this problem, let me know if I haven’t answered your questions by the end.
First, problem structure. You have three unknowns, which you want to estimate from the data: a shape parameter, a scale parameter, and an indicator of which model is correct.
“Which model is correct” is probably easiest to start with. I’m not completely sure that I’ve followed what your spreadsheet is doing, but if I understand correctly, then that’s probably an overly-complicated way to tackle the problem. You want P(model|data), which will be determined via Bayes’ Rule by P(data|model) and your prior probability for each model being correct. The prior is unlikely to matter much unless your dataset is tiny, so P(data|model) is the important part. That’s an integral:
In your case, you’re approximating that integral with a grid over μ and σ (dP(μ,σ) is a shorthand for ρ(μ,σ)dμdσ here). Rather than whatever you’re doing with timesteps, you can probably just take the product of P(xi|μ,σ,model), where xi is the lifetime of the ith component in your dataset, then sum over the grid. (If you are dealing with online data streaming in, then you would need to do the timestep thing.)
That takes care of the model part. Once that’s done, you’ll probably find that one model is like a gazillion times more likely than the other two, and you can just throw out the other two.
On to the 95% CI for the worst part in a million.
The distribution you’re interested in here is P(xworst|model,data). xworst is an order statistic. Its CDF is basically the CDF for any old point x raised to the power of 1000000; read up on it to see exactly what expression to use. So if we wanted to do this analytically, we’d first compute P(x|model,data) via Bayes’ Rule:
… where both pieces on the right would involve our integral from earlier. Basically, you imagine adding one more point x to the dataset and see what that would do to P(data|model). If we had a closed-form expression for that distribution, then we could just raise the CDF to the millionth power, we’d get a closed-form expression for the millionth order statistic, and from there we’d get a 95% CI in the usual way.
In practice, that’s probably difficult, so let’s talk about how to approximate it numerically.
First, the order statistic part. As long as we can sample from the posterior distribution P(x|model,data), that part’s easy: generate a million samples of x, take the worst, and you have a sample of xworst. Repeat that process a bunch of times to compute the 95% CI. (This is not the same as the worst component in 20M, but it’s not any harder to code up.)
Next, the posterior distribution for x. This is going to be driven by two pieces: uncertainty in the parameters μ and σ, and random noise from the distribution itself. If the dataset is large enough, then the uncertainty in μ and σ will be small, so the distribution itself will be the dominant term. In that case, we can just find the best-fit (i.e. maximum a-posteriori) estimates of μ and σ, and then declare that P(x|model,data) is approximately the standard distribution (Weibull, log-normal, etc) with those exact parameters. Presumably we can sample from any of those distributions with known parameters, so we go do the order statistic part and we’re done.
If the uncertainty in μ and σ is not small enough to ignore, then the problem gets more complicated—we’ll need to sample from the posterior distribution P(μ,σ|model,data). At that point we’re in the land of Laplace approximation and MCMC and all that jazz; I’m not going to walk through it here, because this comment is already really long.
So one last thing to wrap it up. I wrote all that out because it’s a great example problem of how to Bayes, but there’s still a big problem at the model level: the lifetime of the millionth-worst component is probably driven by qualitatively different processes than the vast majority of other components. If some weird thing happens one time in 10000, and causes problems in the components, then a best-fit model of the whole dataset probably won’t pick it up at all. Nice-looking distributions like Weibull or log-normal just aren’t good at modelling two qualitatively different behaviors mixed into the same dataset. There’s probably some standard formal way of dealing with this kind of thing—I hear that “Rare Event Modelling” is a thing, although I know nothing about it—but the fundamental problem is just getting any relevant data at all. If we only have a hundred thousand data points, and we think that millionth-worst is driven by qualitatively different processes, then we have zero data on the millionth-worst, full stop. On the other hand, if we have a billion data points, then we can just throw out all but the worst few thousand and analyse only those.
A bit more explanation on what the Kelly Criterion is, for those who haven’t seen it before: suppose you’re making a long series of independent bets, one after another. They don’t have to be IID, just independent. They key insight is that the long-run payoff will be the product of the payoff of each individual bet. So, from the central limit theorem, the logarithm of the long-run payoff will converge to the average logarithm of the individual payoffs times the number of bets.
This leads to a simple statement of the Kelly Criterion: to maximize long-run growth, maximize the expected logarithm of the return of each bet. It’s quite general—all we need is multiplicative returns and some version of the central limit theorem.
I’m not really convinced by this argument. Yes, Newcomen’s specific design needed precise manufacturing capability. But I would expect that, if there had been demand for steam engines earlier, someone would have found a design which could work with lower-precision manufacturing. Newcomen just used what was available.
Also, I intended Newcomen as an example of an early steam engine which failed to catch on, because it wasn’t very profitable yet.
Test is easy: have the inputs become cheaper and/or the outputs become more expensive, compared to alternative technologies? In other words, is it more profitable now?
I’ve been chewing on that one a lot. I don’t have a satisfying answer yet. The sheer size/density of the population is one hypothesis, and crop yields are another (rice vs wheat). But I don’t feel like I understand it yet.
Here’s an alternative hypothesis for why the Chinese didn’t adopt the press, even after the introduction of paper. It also explains why the Chinese didn’t adopt wind/water mills, artillery, the slave trade, and ultimately automation: the cost of capital relative to labor was much higher in China than Europe. Across the board, we see much lower Chinese adoption of capital-intensive technology in favor of labor-intensive alternatives, even when the technical prerequisites were met centuries earlier.
Yes! I was thinking about adding a couple paragraphs about this, but couldn’t figure out how to word it quite right.
When you’re trying to create solid theories de-novo, a huge part of it is finding people who’ve done a bunch of experiments with it, looking at the outcomes, and paying really close attention to the places where they don’t match your existing theory. Elinor Ostrom is one of the best examples I know: she won a Nobel in economics for basically saying “ok, how do people actually solve commons problems in practice, and does it make sense from an economic perspective?”
In the case of a wheel with weights on it, that’s been nailed down really well already by generations of physicists, so it’s not a very good example for theory-generation.
But one important aspect does carry over: you have to actually do the math, to see what the theory actually predicts. Otherwise, you won’t notice when the experimental outcomes don’t match, so you won’t know that the theory is incomplete.
Even in the wheel example, I’d bet a lot of physics-savvy people would just start from “oh, all that matters here is moment of inertia”, without realizing that it’s possible to shift the initial gravitational potential. But if you try a few random configurations, and actually calculate how fast you expect them to go, then you’ll notice very quickly that the theory is incomplete.
I think this is related to a general class of mistakes, so I just wrote up a post on it.
This case is a bit different from what that post discusses, in that you’re not focused on a non-critical assumption, but on a non-critical method. We can use VNM rationality for decision-making just fine without computing full utilities for every decision; we just need to compute enough to be confident that we’re making the higher-utility choice. For that purpose we can use tricks like e.g. changing the unit of valuation on the fly, making approximations (as long as we keep track of the error bars), etc.