I think you’ve got a lot of the core idea. But it’s not important that we know that the data point has some ranking within a distribution. Let me try and explain the ideas as I understand them.
The unbiased estimator is unbiased in the sense that for any actual value of the thing being estimated, the expected value of the estimation across the possible data is the true value.
To be concrete, suppose I tell you that I will generate a true value, and then add either +1 or −1 to it with equal probability. An unbiased estimator is just to report back the value you get:
E[estimate(x)] = estimate(x + 1)/2 + estimate(x − 1)/2
If the estimate function is identity, we have (x + x +1 −1)/2 = x. So its unbiased.
Now suppose I tell you that I will generate the true value by drawing from a normal distribution with mean 0 and variance 1, and then I tell you 23,000 as the reported value. Via Bayes, you can see that it is more likely that the true value is 22,999 than 23,001. But the unbiased estimator blithely reports 23,000.
So, though the asymmetry is doing some work here (the further we move above 0, the more likely that +1 rather than −1 is doing some of the work), it could still be that 23,000 is the smallest of the values I sampled.
“So, though the asymmetry is doing some work here (the further we move above 0, the more likely that +1 rather than −1 is doing some of the work), it could still be that 23,000 is the smallest of the values I sampled”—That’s very interesting.
So I looked at the definition on Wikipedia and it says: “An estimator is said to be unbiased if its bias is equal to zero for all values of parameter θ.”
This greatly clarifies the situation for me as I had thought that the bias was a global aggregate, rather than a value calculated for each value of the parameter being optimised (say basketball ability). Bayesian estimates are only unbiased in the former, weaker sense. For normal distributions, the Bayesian estimate is happy to underestimate the extremeness of values in order to narrow the probability distribution of predictions for less extreme values. In other words, it is accepting a level of bias in order to narrow the range.