Rating each decision on a scale of 1 to 10 and then taking a weighted average is a recipe for biasing the result against intervention, since you’ve created a hard upper limit for how much you count an intervention as helping, so you’ll count a successful intervention as 10 and be unable to count a successful intervention that does even more good as more than 10. (This has a similar problem at the low end of the scale, but that doesn’t affect the final result since you can’t go below zero intervention.)
This also produces bad results in cases where the intervention failed because it was insufficient. You’d end up concluding that intervention is bad when it may just be that insufficient intervention is bad. This method has clause 2 to cover similarity of case, but not similarity of intervention, and at any rate “similarity” is a fuzzy concept. If bombing half the country is a disaster and bombing a whole country succeeds, is bombing half a country “similar” to bombing a whole country? (Actually, you usually end up compressing all the dispute over intervention into a dispute over how similar two cases are.)
And it’s generally a bad idea to put on a numerical scale things that you can’t actually measure numerically. It gives a false appearance of accuracy and precision, like a company executive who wants to see figures for his company improve but doesn’t actually care where the figures come from.
Also, “level of improvement created” is subject to noise. It is possible for an improvement to fail for reasons unrelated to the effectiveness of the intervention, like if the country gets hit by a meteor the next day (or more realistically, gets invaded or attacked the next day).
Basically one huge problem here is that there isn’t enough data compared to the number of variables involved.
Not to mention that this is a problem in what Taleb would call extremistan, i.e., the distribution of possible outcomes from intervening, or not-intervening, are fat-tailed and include a lot of rare possibilities that haven’t yet shown up in the data at all.
Rating each decision on a scale of 1 to 10 and then taking a weighted average is a recipe for biasing the result against intervention, since you’ve created a hard upper limit for how much you count an intervention as helping, so you’ll count a successful intervention as 10 and be unable to count a successful intervention that does even more good as more than 10. (This has a similar problem at the low end of the scale, but that doesn’t affect the final result since you can’t go below zero intervention.)
This also produces bad results in cases where the intervention failed because it was insufficient. You’d end up concluding that intervention is bad when it may just be that insufficient intervention is bad. This method has clause 2 to cover similarity of case, but not similarity of intervention, and at any rate “similarity” is a fuzzy concept. If bombing half the country is a disaster and bombing a whole country succeeds, is bombing half a country “similar” to bombing a whole country? (Actually, you usually end up compressing all the dispute over intervention into a dispute over how similar two cases are.)
And it’s generally a bad idea to put on a numerical scale things that you can’t actually measure numerically. It gives a false appearance of accuracy and precision, like a company executive who wants to see figures for his company improve but doesn’t actually care where the figures come from.
Also, “level of improvement created” is subject to noise. It is possible for an improvement to fail for reasons unrelated to the effectiveness of the intervention, like if the country gets hit by a meteor the next day (or more realistically, gets invaded or attacked the next day).
Basically one huge problem here is that there isn’t enough data compared to the number of variables involved.
Not to mention that this is a problem in what Taleb would call extremistan, i.e., the distribution of possible outcomes from intervening, or not-intervening, are fat-tailed and include a lot of rare possibilities that haven’t yet shown up in the data at all.