The Optimizer’s Curse and How to Beat It

lukeprog16 Sep 2011 2:46 UTC

99 points

The best laid schemes of mice and men
Go often askew,
And leave us nothing but grief and pain,
For promised joy!

Consider the following question:

A team of decision analysts has just presented the results of a complex analysis to the executive responsible for making the decision. The analysts recommend making an innovative investment and claim that, although the investment is not without risks, it has a large positive expected net present value… While the analysis seems fair and unbiased, she can’t help but feel a bit skeptical. Is her skepticism justified?¹

Or, suppose Holden Karnofsky of charity-evaluator GiveWell has been presented with a complex analysis of why an intervention that reduces existential risks from artificial intelligence has astronomical expected value and is therefore the type of intervention that should receive marginal philanthropic dollars. Holden feels skeptical about this ‘explicit estimated expected value’ approach; is his skepticism justified?

Suppose you’re a business executive considering n alternatives whose ‘true’ expected values are μ₁, …, μ_n. By ‘true’ expected value I mean the expected value you would calculate if you could devote unlimited time, money, and computational resources to making the expected value calculation.² But you only have three months and $50,000 with which to produce the estimate, and this limited study produces estimated expected values for the alternatives V₁, …, V_n.

Of course, you choose the alternative i* that has the highest estimated expected value V_i*. You implement the chosen alternative, and get the realized value x_i*.

Let’s call the difference x_i* - V_i* the ‘postdecision surprise’.³ A positive surprise means your option brought about more value than your analysis predicted; a negative surprise means you were disappointed.

Assume, too kindly, that your estimates are unbiased. And suppose you use this decision procedure many times, for many different decisions, and your estimates are unbiased. It seems reasonable to expect that on average you will receive the estimated expected value of each decision you make in this way. Sometimes you’ll be positively surprised, sometimes negatively surprised, but on average you should get the estimated expected value for each decision.

Alas, this is not so; your outcome will usually be worse than what you predicted, even if your estimate was unbiased!

Why?

...consider a decision problem in which there are k choices, each of which has true estimated [expected value] of 0. Suppose that the error in each [expected value] estimate has zero mean and standard deviation of 1, shown as the bold curve [below]. Now, as we actually start to generate the estimates, some of the errors will be negative (pessimistic) and some will be positive (optimistic). Because we select the action with the highest [expected value] estimate, we are obviously favoring overly optimistic estimates, and that is the source of the bias… The curve in [the figure below] for k = 3 has a mean around 0.85, so the average disappointment will be about 85% of the standard deviation in [expected value] estimates. With more choices, extremely optimistic estimates are more likely to arise: for k = 30, the disappointment will be around twice the standard deviation in the estimates.⁴

This is “the optimizer’s curse.” See Smith & Winkler (2006) for the proof.

The Solution

The solution to the optimizer’s curse is rather straightforward.

...[we] model the uncertainty in the value estimates explicitly and use Bayesian methods to interpret these value estimates. Specifically, we assign a prior distribution on the vector of true values μ = (μ₁, …, μ_n) and describe the accuracy of the value estimates V = (V₁, …, V_n) by a conditional distribution V|μ. Then, rather than ranking alternatives. based on the value estimates, after we have done the decision analysis and observed the value estimates V, we use Bayes’ rule to determine the posterior distribution for μ|V and rank and choose among alternatives based on the posterior means...

The key to overcoming the optimizer’s curse is conceptually very simple: treat the results of the analysis as uncertain and combine these results with prior estimates of value using Bayes’ rule before choosing an alternative. This process formally recognizes the uncertainty in value estimates and corrects for the bias that is built into the optimization process by adjusting high estimated values downward. To adjust values properly, we need to understand the degree of uncertainty in these estimates and in the true values..⁵

To return to our original question: Yes, some skepticism is justified when considering the option before you with the highest expected value. To minimize your prediction error, treat the results of your decision analysis as uncertain and use Bayes’ Theorem to combine its results with an appropriate prior.

Notes

¹ Smith & Winkler (2006).

² Lindley et al. (1979) and Lindley (1986) talk about ‘true’ expected values in this way.

³ Following Harrison & March (1984).

⁴ Quote and (adapted) image from Russell & Norvig (2009), pp. 618-619.

⁵ Smith & Winkler (2006).

References

Harrison & March (1984). Decision making and postdecision surprises. Administrative Science Quarterly, 29: 26–42.

Lindley, Tversky, & Brown. 1979. On the reconciliation of probability assessments. Journal of the Royal Statistical Society, Series A, 142: 146–180.

Lindley (1986). The reconciliation of decision analyses. Operations Research, 34: 289–295.

Russell & Norvig (2009). Artificial Intelligence: A Modern Approach, Third Edition. Prentice Hall.

Smith & Winkler (2006). The optimizer’s curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52: 311-322.

What links here?

lukeprog16 Sep 2011 2:46 UTC

99 points

84 comments3 min readLW link Archive

Optimization AI Mild Optimization

Vladimir_Nesov 16 Sep 2011 9:35 UTC
36 points
But all you’ve done after “adjusting” the expected value estimates was producing a new batch of expected value estimates, which just shows that the original expected value estimates were not done very carefully (if there was an improvement), or that you face the same problem all over again...

Am I missing something?
- orthonormal 17 Sep 2011 13:42 UTC
  2 points
  Parent
  I’m thinking of this as “updating on whether I actually occupy the epistemic state that I think I occupy”, which one hopes would be less of a problem for a superintelligence than for a human.
  
  It reminds me of Yvain’s Confidence Levels Inside and Outside an Argument.
  - NancyLebovitz 17 Sep 2011 15:17 UTC
    3 points
    Parent
    I expect it to be a problem—probably as serious—for superintelligence. The universe will always be bigger and more complex than any model of it, and I’m pretty sure a mind can’t fully model itself.
    
    Superintelligences will presumably have epistemic problems we can’t understand, and probably better tools for working on them, but unless I’m missing something, there’s no way to make the problem go away.
    What links here?
    XiXiDu's comment on The Cognitive Science of Rationality by lukeprog (18 Sep 2011 13:27 UTC; 2 points)
    - orthonormal 17 Sep 2011 15:53 UTC
      2 points
      Parent
      Yeah, but at least it shouldn’t have all the subconscious signaling problems that compromise conscious reasoning in humans- at least I hope nobody would be dumb enough to build a superintelligence that deceives itself on account of social adaptations that don’t update when the context changes...
- EliasHasle 26 Apr 2023 10:44 UTC
  1 point
  Parent
  I must admit that I did not understand everything in the paper, but I think this excerpt summarizes a crucial point:
  “The key issue here is proper conditioning. The unbiasedness of the value estimates V_i discussed in §1 is unbiasedness conditional on mu. In contrast, we might think of the revised estimates ^v_i as being unbiased conditional on V. At the time we optimize and make the decision, we know V but we do not know mu, so proper conditioning dictates that we work with distributions and estimates conditional on V.”
  The proposed “solution” converts n independent evaluations into n evaluations (estimates) that respect the selection process, but, as far as I can tell, they still rest on prior value estimates and prior knowledge about the uncertainty of those estimates… Which means the “solution” at best limits introduction of optimizer bias, and at worst… masks old mistakes?
- CynicalOptimist 17 Nov 2016 18:28 UTC
  0 points
  Parent
  Well in some circumstances, this kind of reasoning would actually change the decision you make. For example, you might have one option with a high estimate and very high confidence, and another option with an even higher estimate, but lower confidence. After applying the approach described in the article, those two options might end up switching position in the rankings.
  
  BUT: Most of the time, I don’t think this approach will make you choose a different option. If all other factors are equal, then you’ll probably still pick the option that has the highest expected value. I think that what we learn from this article is more about something else: It’s about understanding that the final result will probably be lower than your supposedly “unbiased” estimate. And when you understand that, you can budget accordingly.
  - EliasHasle 26 Apr 2023 9:25 UTC
    1 point
    Parent
    The big problem arises when the number of choices is huge and sparsely explored, such as when optimizing a neural network.
    But restricting ourselves to n superficially evaluated choices with known estimate variance in each evaluation and with independent errors/noise, then if – as in realistic cases like Monte Carlo Tree Search – we are allowed to perform some additional “measurements” to narrow down the uncertainty, it will be wise to scrutinize the high-expectance choices most – in a way trying to “falsify” their greatness, while increasing the certainty of their greatness if the falsification “fails”. This is the effect of using heuristics like the Upper Confidence Bound for experiment/branch selection.
    UCB is also described as “optimism in the face of uncertainty”, which kind of defeats the point I am making if it is deployed as decision policy. What I mean is that in research, preparations and planning (with tree search in perfect information games as a formal example where UCB can be applied), one should put a lot of effort into finding out whether the seemingly best choice (of path, policy, etc.) really is that good, and then make a final choice that penalizes remaining uncertainty.
    I would like to throw in a Wikipedia article on a relevant topic, which I came across while reading about the related “Winner’s curse”: https://en.wikipedia.org/wiki/Order_statistic
    The math for order statistics is quite neat as long as the variables are independently sampled from the same distribution. In real life, “sadly”, choice evaluations may not always be from the same distribution… Rather, they are by definition conditional upon the choices. (https://en.wikipedia.org/wiki/Bapat%E2%80%93Beg_theorem provides a kind of solution in the form of an intractable colossus of a calculation.) That is not to say that there can be found no valuable/informative approximations.
jsalvatier 16 Sep 2011 22:13 UTC
19 points
In statistics the solution you describe is called Hierarchical or Multilevel Modeling. You assume that you data is drawn from a set of distributions which have their parameters drawn from another distribution. This automatically shrinks your estimates of the distributions towards the mean. I think it’s a pretty useful trick to know and I think it would be good to do a writeup but I think you might need to have a decent grasp of bayesian statistics first.
What links here?
- Ozzie Gooen's comment on Potential downsides of using explicit probabilities by MichaelA (EA Forum; 21 Jan 2020 11:49 UTC; 3 points)
- Pagw 14 Feb 2017 10:05 UTC
  4 points
  Parent
  Here’s an example, with code, for anyone interested (it’s not by me, I add): http://sl8r000.github.io/ab_testing_statistics/use_a_hierarchical_model/
JoshuaZ 16 Sep 2011 3:01 UTC
15 points
The central point of the optimizer’s curse not one I have seen before and is a very interesting point.

The solution however leaves me feeling slightly unhappy. It isn’t obvious to me what prior one should use in this sort of context. I suspect that a rough estimate by simply using the rule of thumb that the more complicated a logical chain the more likely there is a problem in it might do similar work at a weaker level.

Have you tried to apply this sort of reasoning explicitly to various existential risk considerations? If so, what did you get?
- gwern 16 Sep 2011 14:06 UTC
  21 points
  Parent
  
  The central point of the optimizer’s curse not one I have seen before and is a very interesting point.
  
  Reminds me of the winner’s curse in auctions—the selected bid is the one that is the highest and so most likely to be due to overconfidence/bias.
  - malthrin 16 Sep 2011 15:21 UTC
    7 points
    Parent
    Yes, I recognized that similarity as well. As an aside, Fantasy Football (especially with an auction draft) is a great example to use when explaining these overestimation effects to laypeople.
    - lessdazed 16 Sep 2011 16:40 UTC
      7 points
      Parent
      
      They were talking about the Lottery. Winston looked back when he had gone thirty metres. They were still arguing, with vivid passionate faces. The Lottery, with its weekly payout of enormous prizes, was the one public event to which the proles paid serious attention. It was probable that there were some millions of proles for whom the Lottery was the principal if not the only reason for remaining alive. It was their delight, their folly, their anodyne, their intellectual stimulant. Where the Lottery was concerned, even people who could barely read and write seemed capable of intricate calculations and staggering feats of memory. There was a whole tribe of men who made a living simply by selling systems, forecasts and lucky amulets.
      
      --2001: A Space Odyssey (Homer, translated from ancient Latin)
      What links here?
      lessdazed's comment on [LINK] Being proven wrong is like winning the lottery by Unnamed (30 Oct 2011 7:08 UTC; 7 points)
      - malthrin 16 Sep 2011 19:58 UTC
        8 points
        Parent
        Interesting sourcing on that quote. I’m not sure what you meant to say with it, so I’ll elaborate.
        
        In fantasy sports, you begin by calculating an expected value for each player over the upcoming season. These values are used to construct your team in a draft, which is either turn-based (A picks a player, then B, then C) or auction-based (A, B, and C bid on players from a fixed initial pool of money). As the season goes on, you update your expected values with evidence from the past week’s games in order to decide which players will be active and accrue points for your fantasy team.
        
        The analogy should be obvious for most folks here. You’re combining evidence to form a probability (how good was he last season? Is the new coach’s game plan going to help or hurt his stats? Is he a particularly high injury risk?) and multiplying by utility to form a preference ranking. In an auction draft, the pricing mechanism even requires you to explicitly compute the expected utility values. When games are played, you update on evidence and revise your rankings.
        
        Most people have a hard time relating to decision theory because it doesn’t “feel like” what goes on in their head when they make decisions. Fantasy sports is a useful example because it makes the process explicit. I didn’t fully realize how good a fit it is before this conversation—maybe I should write up an introductory rationality piece on this foundation.
        lessdazed 16 Sep 2011 20:51 UTC
        3 points
        Parent
        The quote is from Orwell’s 1984. The proles are generally ignorant, but good at tracking lottery numbers because it is a game. That’s right, I just generalized from fictional evidence!
        
        I figured if people are going to complain about the Burns quote, I’d give them something to really complain about. Wrong book with a date as a title, wrong author of an Odyssey, wrong language.
        
        Fantasy sports is a great example of where this would be useful, and I can’t think of a better analogy.
cousin_it 17 Sep 2011 14:10 UTC
13 points
Am I missing something, or does the post just say that we shouldn’t use frequentist “unbiased estimators” as if they were Bayesian posterior expected values?
- jsalvatier 17 Sep 2011 15:48 UTC
  5 points
  Parent
  Not quite. If you were to do individual bayesian estimates you would have the same problem because there is shared prior information that would remain unmodeled.
  - cousin_it 17 Sep 2011 20:05 UTC
    6 points
    Parent
    Are you pointing out that each individual Bayesian estimate must be conditioned on all the information available, or is it more subtle than that?
    - jsalvatier 17 Sep 2011 23:31 UTC
      3 points
      Parent
      Nope, that’s it.
Mass_Driver 16 Sep 2011 7:07 UTC
11 points
−1

consider a decision problem in which there are k choices, each of which has true estimated [expected value] of 0.

Lukeprog, if I’ve understood you correctly, then this is no good; this is a corner case. The question to be answered here is whether we should expect a “common sense” executive who favors plans with a high prior estimate to do better than a “technical” analyst who favors plans that perform well according to the formal estimation criteria. By assuming that all prior estimates are identical except for bias, this assumption ensures that the technical analyst will win. This, however, begs the question. One could just as easily assume that there is large variation in the true expected values, and that the formal criteria will always produce an estimate of 0, in which case the common sense executive will always win.

Am I missing something? I like the topic; I would enjoy reading about which approach we should expect to perform better in a typical situation.
- Nisan 16 Sep 2011 21:16 UTC
  10 points
  0
  Parent
  I think the case where all the choices has a “true expected value” of 0 is picked out merely to illustrate the problem.
  - lukeprog 16 Sep 2011 23:08 UTC
    2 points
    0
    Parent
    Yes.
    - Mass_Driver 17 Sep 2011 0:06 UTC
      4 points
      Parent
      That’s fine; you’re more than welcome to illustrate the problem, and your analysis does in fact do that. It does it very well; your writing, as always, is very lucid.
      
      However, you finish the article by claiming that Bayesian analysis can correct for the problem, and this is something that (I don’t think) you even begin to show. Bayesian analysis solves the corner case, but does it bring any traction at all on a typical case?
- RobinZ 16 Sep 2011 15:49 UTC
  5 points
  Parent
  I think it’s worse than that: Karnofsky’s problem is that he has to compare moderate-mean low-variance estimates to large-mean large-variance estimates, but lukeprog’s solution is for comparing the estimate to the result in cases where the variance is equal across the board.
- [deleted] 16 Sep 2011 16:29 UTC
  4 points
  Parent
  Put another way, the higher the variance in the true payoffs, the less relevant the curse. This is the flipside of: the more accurate the estimates, the less relevant the curse.
JGWeissman 16 Sep 2011 19:55 UTC
9 points
0
Is there an example where applying this correction to the expected values changes the decision?
- Manfred 17 Sep 2011 15:44 UTC
  10 points
  Parent
  In any group there’s going to be random noise, and if you choose an extreme value, chances are that value was inflated by noise. In Bayesian, given that something has the highest value, it probably had positive noise, not just positive signal. So the correction is to correct out the expected positive noise you get from explicitly choosing the highest value. Naturally, this correction is greater for when the noise is bigger.
  
  So imagine choosing between black boxes. Each black box has some number of gold coins in it, and also two numbers written on it. The first number, A, on the box is like the estimated expected value, and the second number, B, is like the variance. What happened is that someone rolled two distinct dice with B sides, subtracted die 1 from die 2, and added that to the number of gold coins in the box.
  
  So if you see a box with 40, 3 written on it, you know that it has an expected value of 40 gold coins, but might have as few as 37 or as many as 43.
  
  Now comes the problem: I put 10 boxes in front of you, and tell you to choose the one with the most gold coins. The first box is 50, 1 - a very low-variance box. But the last 9 boxes are all high-uncertainty, all with B=20. The expected values printed on them are as follows [I generated the boxes honestly] : 53, 52, 37, 60, 44, 36, 56, 45, 54. Ooh, one of those boxes has a 60 on it! Pick that one!
  
  Okay, don’t pick that one. Think about it—there are 9 boxes with high variance, and the one you picked probably has unusually large noise. To be special among 9 proposals with high variance, it probably has noise at the 80th+ percentile. What’s the 80th percentile of noise for 1d20 − 1d20? I bet it’s larger than 10. You’re better off just going with the 50, 1 box.
  
  And it’s a good thing you applied that correction, because I generated the boxes by typing “RandomInteger[20,9] - RandomInteger[20,9] + 45” into Wolfram alpha—they each 45 coins each.
  
  So this illustrates that what beating the optimizer’s curse really is is a sort of “correction for multiple comparisons.” If you have a lot of noisy boxes, some of them will look large even when they’re not, even larger than non-noisy boxes.
  - JGWeissman 17 Sep 2011 19:06 UTC
    4 points
    0
    Parent
    That is a good example of how the optimizer’s curse causes an overestimate of the maximum expected value, and even reliably causes a wrong choice to be associated with the maximum expected value. But how do I apply the correction mathematically, so I can know for which expected values on the high uncertainty boxes I should expect their best of them to be better or worse than the low uncertainty box? Even better, how can I deal with situations where the uncertainties of the expected values are not so conveniently categorized (and whose actual values aren’t conveniently uniform)?
    - Manfred 19 Sep 2011 10:52 UTC
      2 points
      Parent
      Oh—I learned how, by the way. You start with some prior over how you expect the actual coins to be distributed, and then you convolute in the noise distribution of each box to get the combined distribution for each box. Then, given where the number on the outside of each box falls on the combined distribution, you can assign how much of that you expect to be signal and how much you expect to be noise by distributing improbability equally between signal and noise. Then you subtract out the expected noise.
    - Manfred 17 Sep 2011 19:48 UTC
      0 points
      Parent
      I’m not sure. It’s probably in the paper.
  - Brickman 28 Sep 2011 1:58 UTC
    0 points
    Parent
    I’m trying to figure out why, from the rules you gave at the start, we can assume that box 60 has more noise than the other boxes with variance of 20. You didn’t, at the outset of the problem, say anything about what the values in the boxes actually were. I would not, taking this experiment, have been surprised to see a box labeled “200”, with a variance of 20, because the rules didn’t say anything about values being close to 50, just close to A. Well, I would’ve been surprised with you as a test-giver, but it wouldn’t have violated what I understood the rules to be and I wouldn’t have any reason to doubt that box was the right choice.
    
    The box with 60 stands out among the boxes with high variance, but you did not say that those boxes were generated with the same algorithm and thus have the same actual value. In fact you implied the opposite. You just told me that 60 was an estimate of its expected value, and 37 was an estimate of one of the other boxes’ expected values. So I would assign a very high probability to it being worth more than the box labeled 37. I understand that the variance is being effectively applied twice to go between the number on the box to the real number of coins (The real number of 45 could make an estimate anywhere from 25 to 65, but if it hit 25 I’d be assigning the real number a lower bound of 5 and if it hit 65 I’d be assigning the real number an upper bound of 85, which is twice that range). (Actually for that reason I’m not sure your algorithm really means there’s a variance of 20 from what you state the expected value to be, but I don’t feel like doing all the math to verify that since it’s tangential to the message I’m hearing from you or what I’m saying). But that doesn’t change the average. The range of values that my box labeled 60 could really contain from being higher than the range the box labeled 37 could really contain, to the best of my knowledge, and both are most likely to fall within a couple coins of the center of that range, with the highest probability concentrated on the exact number.
    
    If the boxes really did contain different numbers of coins, or we just didn’t have reason to assume that they don’t contain different numbers, the box labeled 60 is likely to contain more coins than that ⁵⁰⁄₁ box did. It is also capable of undershooting 50 by ten times as much if unlucky, so if for some reason I absolutely cannot afford to find less than 50 coins in my box the ⁵⁰⁄₁ box is the safer choice—but if I bet on the ⁶⁰⁄₂₀ box 100 times and you bet on the ⁵⁰⁄₁ box 100 times, given the rules you set out in the beginning, I would walk away with 20% more money.
    
    Or am I missing some key factor here? Did I misinterpret the lesson?
    - Manfred 28 Sep 2011 4:41 UTC
      2 points
      Parent
      
      Or am I missing some key factor here? Did I misinterpret the lesson?
      
      The key factor is that the 60,20 box is not in isolation—it is the top box, and so not only do you expect it to have more “signal” (gold) than average, you also expect it to have more noise than average.
      
      You can think of the numbers on the boxes as drawn from a probability distribution. If there was 0 noise, this probability distribution would just be how the gold in the boxes was distributed. But if you add noise, it’s like adding two probability distributions together. If you’re not familiar with what happens, go look it up on wikipedia, but the upshot is that the combined distribution is more spread out than the original. This combined distribution isn’t just noise or just signal, it’s the probability of having some number be written on the outside of the box.
      
      And so if something is the top, very highest box, where should it be located on the combined distribution?
      
      Now, if you have something that’s high on the combined distribution, how much of that is due to signal, and how much of it is due to noise? This is a tougher question, but the essential insight is that the noise shouldn’t be more improbable than the signal, or vice versa—that is, they should both be about the same number of standard deviations from their means.
      
      This means that if the standard deviation of the noise is bigger, then the probable contribution of the noise is greater.
      
      Me saying the same thing a different way can be found here.
      - Brickman 28 Sep 2011 12:15 UTC
        2 points
        Parent
        Oh, I understand now. Even if we don’t know how it’s distributed, if it’s the top among 9 choices with the same variance that puts it in the 80th percentile for specialness, and signal and noise contribute to that equally. So it’s likely to be in the 80th percentile of noise.
        
        It might have been clearer if you’d instead made the boxes actually contain coins normally distributed about 40 with variance 15 and B=30, and made an alternative of ⁵⁰⁄₁, since you’d have been holding yourself to more proper unbiased generation of the numbers and still, in all likelihood, come up with a highest-labeled box that contained less than the sure thing. You have to basically divide your distance from the norm by the ratio of specialness you expect to get from signal and noise. The “all 45” thing just makes it feel like a trick.
        CynicalOptimist 17 Nov 2016 17:27 UTC
        0 points
        Parent
        I think there’s some value in that observation that “the all 45 thing makes it feel like a trick”. I believe that’s a big part of why this feels like a paradox.
        
        If you have a box with the numbers “60” and “20″ as described above, then I can see two main ways that you could interpret the numbers:
        
        A: The number of coins in this box was drawn from a probability distribution with a mean of 60, and a range of 20.
        
        B: The number of coins in this box was drawn from an unknown probability distribution. Our best estimate of the number of coins in this box is 60, based on certain information that we have available. We are certain that the actual value is within 20 gold coins of this.
        
        With regards to understanding the example, and understanding how to apply the kind of Bayesian reasoning that the article recommends, it’s important to understand that the example was based on B. And in real life, B describes situations that we’re far more likely to encounter.
        
        With regards to understanding human psychology, human biases, and why this feels like a paradox, it’s important to understand that we instinctively tend towards “A”. I don’t know if all humans would tend to think in terms of A rather than B, but I suspect the bias applies widely amongst people who’ve studied any kind of formal probability. “A” is much closer to the kind of questions that would be set as exercises in a probability class.
        Manfred 28 Sep 2011 17:37 UTC
        0 points
        Parent
        That’s true—when I wrote the post you replied to I still didn’t really understand the solution—though it did make a good example for JGWeissman’s question. By the time I wrote the post I linked to, I had figured it out and didn’t have to cheat.
  - Oscar_Cunningham 17 Sep 2011 23:25 UTC
    0 points
    Parent
    But if you don’t know that all the high variance boxes have the same mean then 60 is the one to go with. And if you do know they have the same mean, then it’s expected value is no longer 60.
    - Manfred 18 Sep 2011 8:48 UTC
      1 point
      Parent
      Imagine putting gold coins into a bunch of boxes by having them normally distributed about 50 gold coins with standard deviation 10. Then we’ll add some Gaussian noise to the estimates on the boxes—but we’ll split them into 2 groups. Ten boxes will have noise with standard deviation of 5, while the other ten will have a standard deviation of 25.
      
      But since I’ve still kept the simple situation where we just have 2 groups, you can get the overall biggest by just picking the biggest from each group and comparing them. So we can treat the groups independently for a bit. The biggest one is going to have the biggest positive deviation from 50, combined signal and noise. Because I used normal distributions this time, the combined prior+noise distribution is just a bigger normal distribution. So given that something is big or small by this combined distribution, how do we expect the signal and noise distributions to shift? Well, it would be silly to expect one of them to be more improbable than the other, so we expect their means to shift by about the same number of standard deviations for each distribution. This right there means that the bigger the noise, the more of the variation we should attribute to noise. And also the bigger the element in the combined distribution, the larger we should expect its noise to be.
      What links here?
      Manfred's comment on The Optimizer’s Curse and How to Beat It by lukeprog (28 Sep 2011 4:41 UTC; 2 points)
      - Oscar_Cunningham 18 Sep 2011 9:45 UTC
        0 points
        Parent
        But if you know the boxes were originally drawn from N(50,100) then the number on the box is no longer the correct Bayesian mean. All I’m arguing is that once you have your Bayesian expected value you don’t need to update it any further.
        Manfred 18 Sep 2011 10:13 UTC
        3 points
        Parent
        
        All I’m arguing is that once you have your Bayesian expected value you don’t need to update it any further.
        
        That’s pretty uncontroversial, but in practice it means that you end up penalizing high-noise boxes with high values (and boosting high-noise boxes with low values), which I think is a nontrivial result.
- Johnicholas 17 Sep 2011 13:51 UTC
  1 point
  Parent
  I’m trying to imagine a scenario.
  
  Possibly the decider knows that people sometimes make multiplicative errors, transposing numbers or misplacing decimals, and is confronted with a set of estimates hovering around, say, 0.05 (and that is plausible according to the decider’s prior) and a few estimates at estimated around 0.5 and 5.0. Would the correction effectively trim the outliers back to almost exactly 0.05 (because we can’t learn much information from an estimate that probably had at least one mistake in it), and the decider should go with the highest of the “plausible” numbers?
  
  It seems to me like the conditional distributions that would lead to actually changing your decision are nearly as likely to be a source of error as a correction.
DSimon 16 Sep 2011 7:30 UTC
7 points
Would this issue also apply to picking a contractor for a project based on the lowest bid?
- Solvent 16 Sep 2011 21:07 UTC
  3 points
  Parent
  No, because the lowest bid is a commitment from the contractor, not an estimate. This particular problem arises from trying to pick the best option from several estimates.
  - CronoDAS 18 Sep 2011 22:43 UTC
    13 points
    Parent
    Sometimes contractors run out of money before finishing and you have to pay more or they leave you with a half-finished project :(
- PhilGoetz 21 Sep 2011 0:05 UTC
  2 points
  Parent
  It would probably lead to contractors selected that way often going over budget.
handoflixue 16 Sep 2011 19:25 UTC
4 points
I’m not sure how exactly this differs from the GiveWell blog post along the same lines? You seem to both be dealing with roughly the same problem (decision making under uncertainty), and reach the same conclusion (pay attention to the standard deviation, use Bayesian updates)

I did find your graph in the middle a rather useful illustration, but otherwise don’t feel like I’ve come away with anything really new...
- Solvent 16 Sep 2011 21:05 UTC
  10 points
  Parent
  Well, to start with, Luke has provided an actual mechanism for this mistake to occur by.
PhilGoetz 21 Sep 2011 0:01 UTC
3 points
This is interesting, but I don’t see how to apply the solution. Presumably I either have no priors; or the priors are going to be generated by the same process I use to generate the values I am combining them with.

The resulting bias should be smaller if you choose the top 2 or 3 alternatives. E.g., give to 3 charities, not to 1.

How do market traders deal with this problem?
NancyLebovitz 16 Sep 2011 16:35 UTC
2 points
If I understand this correctly, there’s an empirical problem.

How optimistic your most optimistic estimate is going to be is going to be a matter of temperament and knowledge for individuals, and group culture for groups. It seems to me that the correction would need to be determined by experience. Or is this the “appropriate prior” problem?

When I’d only seen the title for this article, I thought it was going to be about the question of how much effort you should put into optimizing.
[deleted] 16 Sep 2011 4:06 UTC
2 points
This is nit-picky, but I don’t think you should attribute to Robert Burns anything other than the words he actually wrote. Meanings change a lot in translation, and it’s not quite fair to do that through invisible sleight of hand. “Robert Burns (standard English translation)” would serve to CYA.
- wnoise 16 Sep 2011 6:21 UTC
  10 points
  Parent
  The original lines:
  
  The best laid schemes o’ Mice an’ Men,
  Gang aft agley,
  An’ lea’e us nought but grief an’ pain,
  For promis’d joy!
  
  are little different than the version Luke quoted, and are mostly understandable (with the exception “gang aft agley”) to a sophisticated English reader with no special knowledge. I am somewhat inclined to call that version a rewrite rather than a translation, just as I would consider some modernized versions of Shakespeare to not be translations, but rewrites.
  
  The standard problem of drawing lines in a continuum rears its head again. There are some reasonable arguments for calling Scots from this time a dialect of English, and many others for calling it a separate language. This is complicated by people’s personal and national identities being involved. Questions like these generally end up being settled more by politics than by details of the different linguistic varieties involved.
- lukeprog 16 Sep 2011 5:17 UTC
  6 points
  Parent
  Okay, I added ‘(translated)’.
- komponisto 16 Sep 2011 5:19 UTC
  5 points
  Parent
  Would you say the same thing if a translation had been quoted of a poem originally in Latin or French?
  
  (My guess: probably not. No one talks about a “standard English translation” of Catullus or Baudelaire. Instead, they credit the translator by name, or simply take the liberty of using the translation as if it were the original author’s words.)
  - [deleted] 16 Sep 2011 5:24 UTC
    0 points
    Parent
    The translator should absolutely be credited by name if he or she is known. Burns has passed kind of into folk status, and is a special case.
    
    I would never quote Catullus or Baudelaire in English as if it were the original author’s words. No. It’s wrong (deprives the translator of rightful credit) -- and, FWIW, it’s also low-status.
    - komponisto 16 Sep 2011 6:02 UTC
      8 points
      Parent
      
      Burns has passed kind of into folk status, and is a special case.
      
      What matters, obviously, is not whether Burns has passed into folk status, but whether the particular translation has. The latter seems an implausible claim (since printed translations can presumably be traced and attributed), but if it were true, then there would be no need for acknowledgement (almost by definition of “folk status”).
      
      My comment arose from the suspicion that you reacted as if Burns had been paraphrased, as opposed to translated—because the original language looks similar enough to English that a translation will tend to look like a paraphrase. I find it unlikely that you would actually have made this comment if lukeprog had quoted Catallus without mentioning the translator; and on the other hand I suspect you would have commented if he had taken the liberty of paraphrasing (or “translating”) a passage from Shakespeare into contemporary English without acknowledging he had done so. My point being that the case of Burns should be treated like the former scenario, rather than the latter, whereas I suspect you intuitively perceived the opposite.
      
      All translation is paraphrase, of course—but there is a difference of connotation that corresponds to a difference in etiquette. When one is dealing with an author writing in the same language as oneself, there is a certain obligation to the original words that does not (cannot) exist in the case of an author writing in a different language. So basically, I saw your comment as not-acknowledging that Burns was writing in a different language.
      
      I would never quote Catullus or Baudelaire in English as if it were the original author’s words. No. It’s wrong (deprives the translator of rightful credit) -- and, FWIW, it’s also low-status.
      
      I don’t see it as lowering the status of the quoter; the status dynamic that I perceive is that it grants very high status to the original author, status so high that we’re willing to overlook the original author’s handicap of speaking a different language. In effect, it grants them honorary in-group status.
      
      For example: Descartes has high enough status that the content of his saying “I think therefore I am” is more important to us than the fact that his actual words would have sounded like gibberish (unless we know French); people who speak gibberish normally have low status. Or, as Arnold Schoenberg once remarked (probably in German), “What the Chinese philosopher says is more important than that he speaks Chinese”. Only high-status people like philosophers get this kind of treatment!
      - Bill_McGrath 16 Sep 2011 10:47 UTC
        6 points
        Parent
        
        Or, as Arnold Schoenberg once remarked (probably in German), “What the Chinese philosopher says is more important than that he speaks Chinese”. Only high-status people like philosophers get this kind of treatment!
        
        Google has let me down in finding this quote, both in English and in roughly-translated German. Where is this from?
        komponisto 18 Sep 2011 1:29 UTC
        0 points
        Parent
        A statement like this is attributed to Schoenberg by a number of people, but I can’t find a specific reference either. Perhaps it was just something he said orally, without ever writing it anywhere.
        garethrees 12 Jan 2012 17:46 UTC
        2 points
        Parent
        The earliest reference I can track down is from 1952. In Roger Sessions: a biography (2008), Andrea Olmstead writes:
        
        [In 1952] Sessions published “Some notes on Schoenberg and the ‘method of composing with twelve tones’.” At the head of the article he quoted from one of Schoenberg’s letters to him: “A Chinese philosopher speaks, of course, Chinese; the question is, what does he say?” Sessions [had performed] the role of a Chinese philosopher in Cleveland.
        
        (The work that Sessions had performed this role in appears to have been Man who ate the popermack in the mid-1920s.)
        
        Sessions’ essay (originally published in The Score and then collected in Roger Sessions on Music) begins:
        
        Arnold Schönberg sometimes said ‘A Chinese philosopher speaks, of course, Chinese; the question is, what does he say?’ The application of this to Schönberg’s music is quite clear. The notoriety which has, for decades, surrounded what he persisted in calling his ‘method of composing with twelve tones’, has not only obscured his real significance, but, by focusing attention on the means rather than on the music itself, has often seemed a barrier impeding a direct approach to the latter.
        
        An entertaining later reference to this quotation appears in Dialogues and a diary by Igor Stravinsky and Robert Craft (1963), where Stravinsky tabulates the differences between himself and Schoenberg, culminating in this comparison:
        
        Stravinsky: ‘What the Chinese philosopher says cannot be separated from the fact that he says it in Chinese.’ (Preoccupation with manner and style.)
        
        Schoenberg: ‘A Chinese philosopher speaks Chinese, but what does he say?’ (‘What is style?’)
        
        garethrees 12 Jan 2012 16:01 UTC
        0 points
        Parent
        This seems to have been Stravinsky’s playful characterization of Schoenberg. See Dialogues by Igor Stravinsky and Robert Craft, p. 108, where Stravinsky tabulates the differences between himself and Schoenberg, culminating in:
        
        Stravinsky: ‘What the Chinese philosopher says cannot be separated from the fact that he says it in Chinese.’ (Preoccupation with manner and style.)
        
        Schoenberg: ‘A Chinese philosopher speaks Chinese, but what does he say?’ (‘What is style?’)
        
        I guess it’s possible that Stravinsky is quoting Schoenberg here, but the parallelism suggests not, and when he does quote Schoenberg (as in row 1 in the table), he gives a citation.
      - wnoise 16 Sep 2011 7:25 UTC
        5 points
        Parent
        
        All translation is paraphrase, of course—but there is a difference of connotation that corresponds to a difference in etiquette. When one is dealing with an author writing in the same language as oneself, there is a certain obligation to the original words that does not (cannot) exist in the case of an author writing in a different language.
        
        Right. But there are no hard-and-fast lines for “same language as oneself”.
        
        So basically, I saw your comment as not-acknowledging that Burns was writing in a different language.
        
        You and I both brought up comparisons with Shakespeare. Both can be difficult to read for a struggling reader. For a sophisticated reader, the gist of both can be gotten with a modicum of effort. Full understanding of either requires a specialized dictionary, as vocabulary is different. So was Shakespeare writing in a different language? Was Burns? What’s the purpose of this distinction? If it’s weighing understanding vs adherence to the original wording, the trade-off is fairly close to the same place for the two. On the other hand, if it’s to acknowledge the politic linguistic classification that Scots is a separate language from Modern English, there is a distinction, as no one cares whether Early Modern English is treated as a separate language from Modern English. (EDIT: I should say that I do think it’s often more useful to consider Scots a separate language. Just because Burns was mostly intelligible to the English does not mean that other authors or speakers generally were.)
        
        French
        
        Meditations was first published in Latin.
      - [deleted] 16 Sep 2011 6:28 UTC
        1 point
        Parent
        My comment arose from the suspicion that you reacted as if Burns had been paraphrased, as opposed to translated
        
        I don’t know what to tell you except that you’re wrong. I know the original poem pretty well (“Gang aft agley” is a famous phrase in some circles). Burns isn’t my specific field, but my impression, backed by a cursory Wikipedia search, is that the name of the original translator has been lost to the mists of history. If anyone can correct me and supply the original translator’s name, I’ll be truly grateful.
        
        I don’t see it as lowering the status of the quote
        
        Yes, you wouldn’t, and I can’t prove it to you except by assembling a conclave of Ivy League-educated snooty New York poets who happen to not be here right now. I will tell you—and you can update scantily, since you don’t trust the source—that the high-status thing to do is to provide quotes in the original language without translation. You are thereby signalling that not only do YOU read Scots Gaelic (fluently, of course), but you expect everyone you come into contact with socially to ALSO be fluent in Scots Gaelic.
        
        The medium-status thing to do is at least to credit or somehow mark the translator, so that people think you are following standard academic rules for citation.
        
        The reason that quoting translations without crediting them as such is low-status is that it leaves you open to charges of not understanding the original source material.
        wnoise 16 Sep 2011 7:39 UTC
        14 points
        Parent
        
        You are thereby signalling that not only do YOU read Scots Gaelic (fluently, of course), but you expect everyone you come into contact with socially to ALSO be fluent in Scots Gaelic.
        
        Scots Gaelic is not Scots (is not Scottish English, though modern speakers of Scots do generally code switch into it with ease, sometimes in a continuous way). Scots Gaelic is a Gaelic, Celtic language. Scots is Germanic. Burns wrote in Scots.
        [deleted] 21 Sep 2011 0:33 UTC
        4 points
        Parent
        You’re right, and thanks for the clarification. As I said, Burns isn’t really my field.
        [deleted] 16 Sep 2011 13:28 UTC
        11 points
        Parent
        Scots Gaelic is a thing, but it is not the language in which Burns wrote. That’s just called Scots. I wouldn’t ordinarily have mentioned it, but… you’re coming off as a bit snobby here. (O wad some Power the giftie gie us, am I right?)
        JoshuaZ 16 Sep 2011 12:42 UTC
        10 points
        Parent
        
        that the high-status thing to do is to provide quotes in the original language without translation
        
        This may be high status in certain social circles (having interacted with the snooty Ivy League educated New York poets also, they certainly think so) but to a lot of people doing so comes across as obnoxious and pretentious, that is an attempt to blatantly signal high status in a way that signals low status.
        
        The highest status thing to do (and just optimal as far as I can tell for actually conveying information) is to include the original and the translation also.
        [deleted] 21 Sep 2011 0:36 UTC
        3 points
        Parent
        I agree that this is probably optimal. My own class background is academics and published writers (both my parents are tenured professors). It’s actually hard trying to explain in a codified way what one knows at a gut level: I know that translations need to be credited, and for status reasons, but press me on the reasons and I’m probably not terribly reliable.
        gwern 21 Sep 2011 1:15 UTC
        11 points
        Parent
        I find it interesting that everyone here is focusing on status; couldn’t it just be that crediting translations is absolutely necessary for the basic scholarly purpose of judging the authority and trustworthiness of the translation and even the original text? And that failing to provide attribution demonstrates a lack of academic expertise, general ignorance of the slipperiness of translation (‘hey, how important could it be?’), and other such problems.
        
        I know I find such information indispensable for my anime Evangelion research (I treat translations coming from ADV very differently from translations by Olivier Hague and that different from translations by Bochan_bird, and so on, to give a few examples), so how much more so for real scholarship?
        [deleted] 21 Sep 2011 1:53 UTC
        10 points
        Parent
        Well, what I originally [see edit] wrote was “It’s wrong (deprives the translator of rightful credit) -- and, FWIW, it’s also low-status.” I think people found the “low-status” part of my claim more interesting, but it wasn’t the primary reason I reacted badly to seeing a translation uncredited as such.
        
        Edit: on reflection, this wasn’t my original justification. I simply reacted with gut-level intuition, knowing it was wrong. Every other explanation is after-the-fact, and therefore suspect.
        JoshuaZ 21 Sep 2011 1:59 UTC
        3 points
        Parent
        Upvoting for realizing that a rational wasn’t your actual reason.
        JoshuaZ 21 Sep 2011 1:26 UTC
        1 point
        Parent
        Yes, agreed. I did note above that including the translation details with the original was optimal for conveying information but I didn’t emphasize it. I think that part of why people have been emphasizing status issues over serious research in this context is that the start of the discussion was about what to do with epigraphs. Since they really are just for rhetorical impact, the status issue matters more for them.
        A1987dM 22 Apr 2012 0:25 UTC
        2 points
        Parent
        
        [if you] provide quotes in the original language without translation [you are signalling that] you expect everyone you come into contact with socially to ALSO be fluent in [the language].
        
        This was the case until about a decade ago, but nowadays it merely signals that you expect the audience to know how (and be willing to) use Google. (The favourite quotations section in my Facebook profile contains quotations in maths, Italian, English, Irish and German and none of them is translated in any other language.)
        ArisKatsaris 16 Sep 2011 13:14 UTC
        0 points
        Parent
        Status is in the map, not in the territory, siduri. The map of “snooty New-York poets” needn’t be our own map.
        JoshuaZ 16 Sep 2011 16:19 UTC
        2 points
        Parent
        
        Status is in the map, not in the territory, siduri. The map of “snooty New-York poets” needn’t be our own map.
        
        Yes but being aware of what signals one is sending out is helpful. Given that humans play status games it is helpful to be aware of how those games function so one doesn’t send signals out that cause people to pay less attention or create other barriers to communication.
        Hey 16 Sep 2011 16:31 UTC
        7 points
        Parent
        Agreed, but it takes a high degree of luminosity to distinguish between tactical use of status to attain a specific objective, and getting emotionally involved and reactive to the signals of other (inducing this state of confusion is pretty much the function of status-signals for most humans, though).
        
        Tactical = dress up, display “irrational confidence”, and play up your achievements to maximize attraction in potential romantic partners, or do well at a job interview.
        
        Emotional-reactive = seeking, and worrying about, the approval of perceived social betters even though there is no logical reason.
        What links here?
        JenniferRM's comment on Your inner Google by PhilGoetz (20 Sep 2011 5:54 UTC; 23 points)
      - prase 19 Sep 2011 19:51 UTC
        0 points
        Parent
        
        Only high-status people like philosophers get this kind of treatment!
        
        Are you saying that always when a sentence is translated, its author must have high status or gains high status at the moment of translation, because the default attitude is to ignore anything originally uttered in foreign language?
        
        If this is what you mean, I find it surprising. I have probably never been in a situation when someone was ignored because he spoke incomprehensible gibberish and that fact was more important than the content of his words. Of course, translation may be costly and people generally pay only for things they deem valuable, which is where the status comes into play. But it doesn’t mean that with low-status people it is more important that they speak gibberish than what they say.
        
        (A thought experiment: A Gujarati speaking beggar approaches a rich English gentleman, says something and goes away. The Englishman’s wife, who is accompanying him at the moment, accidentally understands Gujarati. The man can recognise the language but doesn’t understand a word. What is the probability that he asks his wife “what did he say”? As a control group, imagine the same with an English beggar, this time the gentleman didn’t understand because when the beggar had spoken, a large truck had passed by. Is the probability of asking “what did he say” any different from the first group?)
        komponisto 19 Sep 2011 21:45 UTC
        2 points
        Parent
        
        Are you saying that always when a sentence is translated, its author must have high status or gains high status at the moment of translation, because the default attitude is to ignore anything originally uttered in foreign language?
        
        Yes. More generally, the default attitude is to ignore anything uttered by a member of an outgroup. By calling attention to the fact that a sentence has been translated, one is calling attention to the fact that the author speaks a foreign language and thus to the author’s outgroup status. Omitting mention of a person’s outgroup status is a courtesy extended to those we wish to privilege above typical outgroup members.
        
        (A thought experiment: A Gujarati speaking beggar approaches a rich English gentleman, says something and goes away. The Englishman’s wife, who is accompanying him at the moment, accidentally understands Gujarati. The man can recognise the language but doesn’t understand a word. What is the probability that he asks his wife “what did he say”? As a control group, imagine the same with an English beggar, this time the gentleman didn’t understand because when the beggar had spoken, a large truck had passed by. Is the probability of asking “what did he say” any different from the first group?)
        
        Curiosity about what a low-status person says does not imply that one thinks the content of their words is a more important fact about them than their low status. With high probability, the most salient aspect of the beggar from the perspective of the Englishman is that he is a beggar (and, in the first case, a foreign beggar at that). Whatever the beggar said, if the Englishman finds out and deems it worthy of recounting later, I would be willing to bet that he will not omit mention of the fact that he heard it from a beggar.
carey 14 Oct 2012 10:47 UTC
1 point
Note Carl Shulman’s counterargument to the assumption of a normal prior here and the comments traded between Holden and Carl.

“If your prior was that charity cost-effectiveness levels were normally distributed, then no conceivable evidence could convince you that a charity could be 100x as good as the 90th percentile charity. The probability of systematic error or hoax would always be ludicrously larger than the chance of such an effective charity. One could not believe, even in hindsight, that paying for Norman Borlaug’s team to work on the Green Revolution, or administering smallpox vaccines (with all the knowledge of hindsight) actually did much more good than typical. The gains from resources like GiveWell would be small compared to acting like an index fund and distributing charitable dollars widely.”
- Mass_Driver 16 Oct 2012 7:59 UTC
  1 point
  Parent
  The problem with this analysis is that it assumes that the prior should be given the same weight both ex ante and ex post. I might well decide to evenly weight my prior (intuitive) distribution showing a normal curve and my posterior (informed) distribution showing a huge peak for the Green Revolution, in which case I’d only think the Green Revolution was one of the best charitable options, and would accordingly give it moderate funding, rather than all available funding for all foreign aid. But, then, ten years later, with the benefit of hindsight, I now factor in a third distribution, showing the same huge peak for the Green Revolution. And, because the third distribution is based not on intuition or abstract predictive analysis but on actual past results—it’s entitled to much more weight. I might calculate a Bayesian update based on observing my intuition once, my analysis once, and the historical track record ten or twenty times. At that point, I would have no trouble believing that a charity was 100x as good as the 90th percentile. That’s an extraordinary claim, but the extraordinary evidence to support it is well at hand. By contrast, no amount of ex ante analysis would persuade me that your proposed favorite charity is 100x better than the current 90th percentile, and I have no problem with that level of cynicism. If your charity’s so damn good, run a pilot study and show me. Then I’ll believe you.
tetsuo55 16 Sep 2011 20:43 UTC
1 point
quick feedback or question.

In this part: Assume, too kindly, that your estimates are unbiased. And suppose you use this decision procedure many times, for many different decisions, and your estimates are unbiased.

the second time you mention the unbiased makes no sense to me and looks like a typo.
The_Jaded_One 8 Jan 2017 11:04 UTC
0 points
If X = Skill + Luck, with Skill and Luck both random variables, then selecting max(X) will get you something that has high Skill and high Luck.

If Estimate = TrueVal + Error, then max(Estimate) will have both high TrueVal and high Error.

This obvious insight has many applications, especially when the selection is done over a very large number of entities, e.g. trying to emulate the habits of billionaires in order to become rich.
CynicalOptimist 17 Nov 2016 19:26 UTC
0 points
Very interesting. I’m going to try my hand at a short summary:

Assume that you have a number of different options you can choose, that you want to estimate the value of each option and you have to make your best guess as to which option is most valuable. In step one, you generate individual estimates using whatever procedure you think is best. In step 2 you make the final decision, by choosing the option that had the highest estimate in step one.

The point is: even if you have unbiased procedures for creating the individual estimates in step one (ie procedures that are equally likely to overestimate as to underestimate) biases will still be introduced in step 2, when you’re looking at the list of all the different estimates. Specifically, the biases are that the highest estimate(s) are more likely to be overestimates, and the lowest estimate(s) are more likely to be underestimates.
DanielLC 16 Sep 2011 4:16 UTC
−5 points
Am I the only one that thinks that this is a silly definition of bias?

The technical definition of bias, the one you’re using, is that given a true value, the expected value of the estimate is equal to the true value. The one that I’d use is that given an estimate, the expected value of the true value is equal to the estimate. The latter is what you should be minimizing.

You should be using Bayesian methods to find these expected values, and they generally are biased, at leased in the technical sense. You shouldn’t come up with an unbiased estimator and correct for it using Bayesian methods. You should use a biased estimator in the first place.
- Matt_Simpson 16 Sep 2011 5:51 UTC
  2 points
  Parent
  
  The technical definition of bias, the one you’re using, is that given a true value, the expected value of the estimate is equal to the true value. The one that I’d use is that given an estimate, the expected value of the true value is equal to the estimate. The latter is what you should be minimizing.
  
  The technical definition is E[estimate—true value] where the true value is typically taken as a number and not a variable we have uncertainty about, but there’s nothing in this definition preventing the true value from being a random variable.
  - wnoise 16 Sep 2011 8:21 UTC
    0 points
    Parent
    Yes, the technical definition is E[estimate—parameter], but “unbiased” has an implicit “for all parameter values”. You really can’t stick a random variable there and have the same meaning that statisticians use. (That said, I don’t see how DanielLC’s reformulation makes sense.)
    - Matt_Simpson 16 Sep 2011 16:01 UTC
      0 points
      Parent
      It won’t have the same meaning, but nothing in the math prevents you from doing it and it might be more informative since it allows you to look at a single bias number instead of an uncountable set of biases (and Bayesian decision theory essentially does this). To be a little more explicit, the technical definition of bias is:
      
      E[estimator|true value] - true value
      
      And if we want to minimize bias, we try to do so over all possible values of the true values. But we can easily integrate over the space of the true value (assuming some prior over the true value) to achieve
      
      E[ E[estimator|true value] - true value ] = E[ estimator—true value ]
      
      This is similar to the Bayes risk of the estimator with respect to some prior distribution (the difference is that we don’t have a loss function here). By analogy, I might call this “Bayes bias.”
      
      The only issue is that your estimator may be right on average but that doesn’t mean it’s going to be anywhere close to the true value. Usually bias is used along with the variance of the estimator (since MSE(estimator)=Variance(estimator) + [Bias(estimator)]^2 ), but we could just modify our definition of Bayes bias so that we only have to look at one number to take the absolute value of the difference—the numbers closer to zero mean better estimators. Then we’re just calculating Bayes risk with respect to some prior and absolute error loss, i.e.
      
      E[ | estimator—true value | ]
      
      (Which is NOT in general equivalent to | E[estimator—true value] | by Jensen’s inequality)