Confidence intervals seem to be rarely useful, in and of themselves

Context: I read Eliezer Yudkowski’s post “Frequentist statistics are frequently subjective” and it inspired me to get a better first-principles understanding of some statistical concepts. Now these are my personal thoughts about the topic. Don’t believe me; if I had attached confidence intervals to the results of my investigation, they would be very broad!

Confidence intervals frequently cause confusion. For example, the U.S. National Institutes of Health (NIH) claim in their course on “Finding and using health statistics”:

Confidence intervals are frequently reported in scientific literature and indicate how close research results are to reality, or how reliable they are, based on statistical theory. The confidence interval uses the sample to estimate the interval of probable values of the population; the parameters of the population.
For example, if a study is 95% reliable, with a confidence interval of 47-53, that means if researchers did the same study over and over and over again with samples of the whole population, they would get results between 47 and 53 exactly 95% of the time. The reliability in this example refers to the consistency of the measurement, or the ability to repeat it. Poor reliability can happen with a small population, or if the health event being studied does not happen often or at regular times.

(emphasis added)

This paragraph left me confused but I believe the statement is wrong. To get to this point, let’s first give a definition of a confidence interval.

Let’s assume a situation where we want to determine some value $q \in R$ telling us something important about the world (global mean temperature, ratio of defective items in a sample of products, …). So we design some experiment resulting in a vector of measurable results $X$ of random variables. The distribution of $X$ depends in a assumed-to-be-known way on $q$ .

Does the outcome of the experiment, the realization of $X$ instead of its distribution, tell us something about $p$ ? Not with certainty. However, we can use strategies that will often succeed giving us true information about $q$ .

When I say “confidence inverval” for $q$ inferred from $X$ , this isn’t a canonically well-defined concept in and of itself. A method to assign confidence intervals is a function $C$ that maps each realization of X to an interval of real numbers (e.g., $0 \mapsto [2, 3], 1 \mapsto [0, 4]$ ) that satisfies the following property: For every theoretically possible value of $q$ and the corresponding assumed distribution of $X$ , we get $P (q \in C (X)) \geq p$ , where $p$ is some probability called the confidence level of $C$ . (Note that $q$ is a fixed number and not a random variable!)

There can be many different methods to assign confidence intervals and given a certain realization of $X$ and interval $I$ , there is almost always a way to construct a method to find confidence intervals with $C (X) = I$ (Eliezer Yudkowski gives a humorous illustration of this fact, pointing out that there even are methods that sometimes return [“Cheesecake”-”Cheddar”] and can still be counted as methods to assign confidence intervals, if I understand him correctly).

So, assume I want to create a start-up that sells food, but just one kind of food to achieve lower costs. To decide of which food to ramp up production, I run a study assessing what ratio $q$ of U.S. citizens believe that sandwiches are tastier than tomato soup. I ask 100 people, selected uniformly, for their opinion. Depending on $q$ , the number of sandwich fans $X$ among those 100 probands is approximately $Bin (n, q)$ distributed. I use some method $C$ to assign confidence intervals with confidence level 95% and this method returns $[a, b]$ .

What does this tell me?

Consider the case $[a, b] = [.9, 1]$ . This seems to be strong (95%) confidence that people virtually always prefer sandwiches over tomato soup. I decide to invest all my money into sandwich production pipelines.

Now consider the case $[a, b] = [0.02, 0.97]$ . Unfortunately, confidence interval doesn’t seem to tell me enough to justify any investment, so I discard my startup idea.

If there’s high confidence of people preferring either sandwiches or tomato soup, I go with the start-up, otherwise I do nothing. This algorithm seems fairly reasonable, doesn’t it?

Now, it can happen that $C$ has a strange property: High effect sizes, as expressed by $[0.9, 1], [0, 0.15]$ and so on, are only ever attained when $q \in [0.4, 0.6]$ (as long as this doesn’t happen too often, it’s totally possible). In this case, whenever I decide to invest into the startup, the American population seems evenly divided between sandwiches and tomato soup and my start-up will fail. So the expected profit is negative!

What happened? Well, by definition, the method of attaching confidence intervals only guarantees that it only fails to assign a correct interval one out of twenty times. These 5% of cases cause serious costs. The remaining 95% where the confidence interval is correct don’t really help because in these cases, the interval isn’t clearly in favor of neither a low nor a high value of $q$ , causing me to refrain from investing and neither lose nor win anything. (This shouldn’t happen if the NIH’s claim was true!)

Stated differently, the amount of evidence entailed by a certain confidence interval can heavily depend on the choice of the method of attaching confidence intervals. In other words, a plain confidence interval is no evidence for any value of $q$ , except we know the (if you’re a Bayesian: conditional) distribution of $C (X)$ dependent on $q$ , because this would allow us to compute odds ratios. But then we wouldn’t use the definition of a confidence interval anymore.

What I conclude from this anomaly is (1) that confidence intervals don’t err most of the time (e.g. only in 5% of experiments) and (2) one has to be horribly cautions when inferring anything practical from them since that would imply to make one’s own behavior dependent on the confidence intervals, which aren’t evidence per se. However, confidence intervals are quite handy (and better than a point estimate!) from a practical perspective and I haven’t settled to a final conclusion of when they are appropriate.

I’d love to hear your opinions on this anomaly! I’ll also appreciate your criticisms if I overlooked something or made a mistake.