Stefan_Schubert comments on Open Thread March 21 - March 27, 2016

Stefan_Schubert 22 Mar 2016 14:14 UTC
3 points
I have a maths question. Suppose that we are scoring n individuals on their performance in an area where there is significant uncertainty. We are categorizing them into a low number of categories, say 4. Effectively we’re thereby saying that for the purposes of our scoring, everyone with the same score performs equally well. Suppose that we say that this means that all individuals with that score get assigned the mean actual performance of the individuals with that that score. For instance, if there were three people who got the highest score, and their perfomance equals 8, 12 and 13 units, the assigned performance is 11 units.

Now suppose that we want our scoring system to minimise information loss, so that the assigned performance is on average as close as possible to the actual performance. The question is: how do we achieve this? Specifically, how large a proportion of all individuals should fall into each category, and how does that depend on the performance distribution?

It would seem that if performance is linearly increasing as we go from low to high performers, then all categories should have the same number of individuals, whereas if the increase is exponential, then the higher categories should have a smaller number of individuals. Is there a theorem that proves this, and which exacty specifies how large the categories should be for a given shape of the curve? Thanks.
- Vitor 22 Mar 2016 16:18 UTC
  19 points
  Parent
  Your problem is called a clustering problem. First of all, you need to answer how you measure your error (information loss, as you call it). Typical error norms used are l1 (sum of individual errors), l2 (sum of squares of errors, penalizes larger errors more) and l-infinity (maximum error).
  
  Once you select a norm, there always exists a partition that minimizes your error, and to find it there are a bunch of heuristic algorithms, e.g. k-means clustering. Luckily, since your data is one-dimensional and you have very few categories, you can just brute force it (for 4 categories you need to correctly place 3 boundaries, and naively trying all possible positions takes only n^3 runtime)
  
  Hope this helps.
  - Stefan_Schubert 22 Mar 2016 20:45 UTC
    5 points
    Parent
    Thanks a lot! Yes, super-useful.
  - gjm 22 Mar 2016 16:41 UTC
    5 points
    Parent
    A possibly relevant paper for anyone wanting to do this in one dimension to a dataset large enough that they care about efficiency.
- philh 22 Mar 2016 16:00 UTC
  7 points
  Parent
  If I’m understanding this correctly, it sounds like you’re performing k-means clustering.
- johnlawrenceaspden 22 Mar 2016 15:56 UTC
  2 points
  Parent
  You’d minimize information loss by giving the actual scores.
  
  The argument is ‘grading on the curve’ vs ‘ABCDEF’. The first way is fair, but it promotes extreme competition to be in the top 1% (or ‘The Senior Wrangler’, as we used to call it), which may not be desirable. The second way hands out random bonuses and penalties to individuals near the arbitrary boundaries.
  
  I was in the top 25% of my year in terms of marks, I believe. I was a ‘Senior Optime’, or ‘got a second’. A class that stretched from around 25%-75%.
  
  Not bitter, or anything.