It seems like we suck at using scales “from one to ten”. Video game reviews nearly always give a 7-10 rating. Competitions with scores from judges seem to always give numbers between eight and ten, unless you crash or fall, and get a five or six. If I tell someone my mood is a 5⁄10, they seem to think I’m having a bad day. That is, we seem to compress things into the last few numbers of the scale. Does anybody know why this happens? Possible explanations that come to mind include:
People are scoring with reference to the high end, where “nothing is wrong”, and they do not want to label things as more than two or three points worse than perfect
People are thinking in terms of grades, where 75% is a C. People think most things are not worse than a C grade (or maybe this is just another example of the pattern I’m seeing)
I’m succumbing to confirmation bias and this isn’t a real pattern
I’m succumbing to confirmation bias and this isn’t a real pattern
No, this is definitely a real pattern. YouTube switched from a 5-star rating system to a like/dislike system when they noticed, and videogames are notorious for rank inflation.
Partial explanation: we interpret these scales as going from worst possible to best possible, and
games that get as far as being on sale and getting reviews are usually at least pretty good because otherwise there’d be no point selling them and no point reviewing them
people entering competitions are usually at least pretty good because otherwise they wouldn’t be there
a typical day is actually quite a bit closer to best possible than worst possible, because there are so many at-least-kinda-plausible ways for it to go badly
One reason why this is only a partial explanation is that “possible” obviously really means something like “at least semi-plausible” and what’s at least semi-plausible depends on context and whim. But, e.g., suppose we take it to mean something like: take past history, discard outliers at both ends, and expand the range slightly. Then I bet what you find is that
most games that go on sale and attract enough attention to get reviewed are broadly of comparable quality
but a non-negligible fraction are quite a lot worse because of some serious failing in design or management or something
most performances in competitions at a given level are broadly of comparable quality
but a non-negligible fraction are quite a lot worse because the competitor made a mistake of some kind
most of a given person’s days are roughly equally satisfactory
but a non-negligible fraction are quite a lot worse because of illness, work stress, argument with a family member, etc.
so that in order for a scale to be able to cover (say) 99% of cases it needs to extend quite a bit further downward than upward relative to the median case.
a typical day is actually quite a bit closer to best possible than worst possible, because there are so many at-least-kinda-plausible ways for it to go badly
Think about it in therms of probability space. If somthign is basically functional, then there are a near- infinite number of ways for it to be worse, but a finite number of ways for it to get better.
RottenTomatoes has much broader ratings. The current box office hits range from 7% to 94%. This is because they aggregate binary “positive” and “negative” reviews. As jaime2000 notes, Youtube has switched to a similar rating system and it seems to keep things very sensitive.
People are thinking in terms of grades, where 75% is a C. People think most things are not worse than a C grade (or maybe this is just another example of the pattern I’m seeing)
I don’t think it’s this. Belgium doesn’t use letter-grading and still succumbs to the problem you mentioned in areas outside the classroom.
Points out of a maximum. The teacher is supposed to decide in advance how much points a test will be worth (5, 10, 20 and 25 being common options, but I’ve also had tests where I scored 17,26⁄27) and then decides how much points a question will be worth. You need to get half of the maximum or more for a passing grade.
That’s in high school. In university everything is scored out of a maximum of 20 points.
That’s not an explanation, just a symptom of the problem. People of mediocre talent and high talent both get A—that’s part of the reason why we have to use standardized tests with a higher ceiling.
My intuition is that the top few notches are satisficing, whereas all lower ratings are varying degrees of non-satisficing. The degree to which everything tends to cluster at the top represents the degree to which everything is satisfactory for practical purposes. In situations where the majority of the rated things are not satisfactory (like the Putnam—nothing less than a correct proof is truly satisfactory), the ratings will cluster near the bottom.
For example, compare motels to hotels. Motels always have fewer stars, because motels in general are worse. Whereas, say, video games will tend to cluster at the top because video games in general are satisfactorily fun.
Or, think Humanities vs. Engineering grades. Humanities students in general satisfy the requirements to be historians and writers or liberal-arts-educated-white-collar workers more than Engineering students satisfy the requirements to be engineers.
That’s not an explanation, just a symptom of the problem.
This is what I was trying to convey when I said it might be another example of the problem.
I think it’s reasonable, in many contexts, to say that achieving 75% of the highest possible score on an exam should earn you what most people think of as a C grade (that is, good enough to proceed with the next part of your education, but not good enough to be competitive).
I would say that games are different. There is not, as far as I know, a quantitative rubric for scoring a game. A 6⁄10 rating on a game does not indicate that the game meets 60% of the requirements for a perfect game. It really just means that it’s similar in quality to other games that have received the same score, and usually a 6⁄10 game is pretty lousy. I found a histogram of scores on metacritic:
The peak of the distributions seems to be around 80%, while I’d eyeball the median to be around 70-75%. There is a long tail of bad games. You may be right that this distribution does, in some sense, reflect the actual distribution of game quality. My complaint is that this scoring system is good at resolving bad games from truly awful games from comically terrible games, but it is bad at resolving a good game from a mediocre game.
What I think it should be is a percentile-based score, like Lumifer describes:
Consider this example: I come up to you and ask “So, how was the movie?”. You answer “I give it a 6 out of 10″. Fine. I have some vague idea of what you mean. Now we wave a magic wand and bifurcate reality.
In branch 1 you then add “The distribution of my ratings follows the distribution of movie quality, savvy?” and let’s say I’m sufficiently statistically savvy to understand that. But… does it help me? I don’t know the distribution of movie quality. it’s probably bell-shaped, maybe, but not quite normal if only because it has to be bounded, I have no idea if its skewed, etc.
In branch 2 you then add “The rating of 6 means I rate the movie to be in the sixth decile”. Ah, that’s much better. I now know that out of 10 movies that you’ve seen five were probably worse and three were probably better. That, to me, is a more useful piece of information.
Then again, maybe it’s difficult to discern a difference in quality between a 60th percentile game and an 80th percentile game.
I’ve noticed the same thing. Part of it might be that reviewers are reluctant to alienate fans of [thing being reviewed]. Another explanation is that they are intuitively norming against a wider degree of things than they actually review. For example, I was buying a smartphone recently, and a lot of lower-end devices I was considering had few reviews, but famous high-end brands (like iPhone Galaxy S, etc.) are reviewed by pretty much everyone.
Playing devil’s advocate, it might be that there are more perceivable degrees of badness/more ways to fail than there are of goodness, so we need a wider range of numbers to describe and fairly rank the failures.
Math competitions often have the opposite problem. The Putnam competition, for example, often has a median score of 0 or 1 out of 120.
I’m not sure this is a good thing. Participating in a math competition and getting 0 points is pretty discouraging, in a field where self-esteem is already an issue.
Interestingly enough, the scores on individual questions are extremely bimodal. They’re theoretically out of 10 but the numbers between 3 and 7 are never used.
In medicine we try to make people rate their symptoms, like pain, from one to ten. It’s pretty much never under 5. Of course there’s a selection effect and people don’t like to look like whiners but I’m not convinced these fully explain the situation.
In Finland the lowest grade you can get from primary education to high school is 4 so that probably affects the situation too.
In medicine we try to make people rate their symptoms, like pain, from one to ten. It’s pretty much never under 5.
How do you then interpret their responses? Do you compare only the responses of the same person at different times, or between persons (or to guide initial treatment)? Do you have a reference scale that translates self-reported pain to something with an objective referent?
Do you compare only the responses of the same person at different times
Yes. There’s too much variation between persons. I also think there’s variation between types of pain and variation depending on whether there are other symptoms. There are no objective specific referents but people who are in actual serious pain usually look like it, are tachycardic, hypertensive, aggressive, sweating, writhing or very still depending on what type of pain were talking about. Real pain is also aggravated by relevant manual examinations.
In medicine we try to make people rate their symptoms, like pain, from one to ten. It’s pretty much never under 5.
This is actually what initially got me thinking about this. I read a half-satire thing about people misusing pain scales. Since my only source for the claim that people do this was a somewhat satirical article, I didn’t bring it up initially.
I was surprised when I heard that people do this, because I figured most people getting asked that question aren’t in near as much pain as they could be, and they don’t have much to gain by inflating their answer. When I’ve been asked to give an answer on the pain scale, I’ve almost always felt like I’m much closer to no pain than to “the worst pain I can imagine” (which is what I was told a ten is), and I can imagine being in such awful pain that I can’t answer the question. I think I answered seven one time when I had a bone sticking through my skin (which actually hurt less than I might have thought).
most people getting asked that question aren’t in near as much pain as they could be, and they don’t have much to gain by inflating their answer.
Maybe they think that by inflating their answer they gain, on the margin, better / more intensive / more prompt medical service. Especially in an ER setting where they may intuit themselves to be competing against other patients being triaged and asked the same question, they might perceive themselves (consciously or not) to be in an arms race where the person who claims to be experiencing the most pain gets treated first.
I tried to change out the 10 rating for a z-score rating in my own conversations. It failed due to my social circles not being familiar with the normal bell curve.
The intent was to communicate one piece of information without confusion: where on the measurement spectrum the item fits relative to others in its group. As opposed to delivering as much information as possible, for which there are more nuanced systems.
Most things I am rating do not have a uniform distribution, I tried to follow a normal distribution because it would fit the greater majority of cases. We lose information and make assumptions when we measure data on the wrong distribution, did you fit to uniform by volume or by value? It was another source of confusion.
As mentioned, this method did fail. I changed my methods to saying ‘better than 90% of the items in its grouping’ and had moderate success. While solving the uniform/normal/Chi-squared distribution problem it is still too long winded for my tastes.
Most things I am rating do not have a uniform distribution
The distribution of your ratings does not need to follow the distribution of what you are rating. For maximum information your (integer) rating should point to a quantile—e.g. if you’re rating on a 1-10 scale your rating should match the decile into which the thing being rated falls. And if your ratings correspond to quantiles, the ratings themselves are uniformly distributed.
We have different goals. I want to my rating to reflect the items relative position in its group, you want a rating to reflect the items value independent of the group.
Doesn’t seem so. If you rate by quintiles your rating effectively indicates the rank of the bucket to which the thing-being-rated belongs. This reflects “the item’s relative position in its group”.
If you want your rating to reflect not a rank but something external, you can set up a variety of systems, but I would expect that for max information your rating would have to point a quintile of that external measure of the “value independent of the group”.
Trying to stab at the heart of the issue: I want the distribution of the ratings to follow the distribution of the rated because when looking at the group this provides an additional piece of information.
Well, at this point the issue becomes who’s looking at your rating. This “additional piece of information” exists only for people who have a sufficiently large sample of your previous ratings so they understand where the latest rating fits in the overall shape of all your ratings.
Consider this example: I come up to you and ask “So, how was the movie?”. You answer “I give it a 6 out of 10″. Fine. I have some vague idea of what you mean. Now we wave a magic wand and bifurcate reality.
In branch 1 you then add “The distribution of my ratings follows the distribution of movie quality, savvy?” and let’s say I’m sufficiently statistically savvy to understand that. But… does it help me? I don’t know the distribution of movie quality. it’s probably bell-shaped, maybe, but not quite normal if only because it has to be bounded, I have no idea if its skewed, etc.
In branch 2 you then add “The rating of 6 means I rate the movie to be in the sixth decile”. Ah, that’s much better. I now know that out of 10 movies that you’ve seen five were probably worse and three were probably better. That, to me, is a more useful piece of information.
It seems like we suck at using scales “from one to ten”. Video game reviews nearly always give a 7-10 rating. Competitions with scores from judges seem to always give numbers between eight and ten, unless you crash or fall, and get a five or six. If I tell someone my mood is a 5⁄10, they seem to think I’m having a bad day. That is, we seem to compress things into the last few numbers of the scale. Does anybody know why this happens? Possible explanations that come to mind include:
People are scoring with reference to the high end, where “nothing is wrong”, and they do not want to label things as more than two or three points worse than perfect
People are thinking in terms of grades, where 75% is a C. People think most things are not worse than a C grade (or maybe this is just another example of the pattern I’m seeing)
I’m succumbing to confirmation bias and this isn’t a real pattern
No, this is definitely a real pattern. YouTube switched from a 5-star rating system to a like/dislike system when they noticed, and videogames are notorious for rank inflation.
Partial explanation: we interpret these scales as going from worst possible to best possible, and
games that get as far as being on sale and getting reviews are usually at least pretty good because otherwise there’d be no point selling them and no point reviewing them
people entering competitions are usually at least pretty good because otherwise they wouldn’t be there
a typical day is actually quite a bit closer to best possible than worst possible, because there are so many at-least-kinda-plausible ways for it to go badly
One reason why this is only a partial explanation is that “possible” obviously really means something like “at least semi-plausible” and what’s at least semi-plausible depends on context and whim. But, e.g., suppose we take it to mean something like: take past history, discard outliers at both ends, and expand the range slightly. Then I bet what you find is that
most games that go on sale and attract enough attention to get reviewed are broadly of comparable quality
but a non-negligible fraction are quite a lot worse because of some serious failing in design or management or something
most performances in competitions at a given level are broadly of comparable quality
but a non-negligible fraction are quite a lot worse because the competitor made a mistake of some kind
most of a given person’s days are roughly equally satisfactory
but a non-negligible fraction are quite a lot worse because of illness, work stress, argument with a family member, etc.
so that in order for a scale to be able to cover (say) 99% of cases it needs to extend quite a bit further downward than upward relative to the median case.
Think about it in therms of probability space. If somthign is basically functional, then there are a near- infinite number of ways for it to be worse, but a finite number of ways for it to get better.
http://xkcd.com/883/
RottenTomatoes has much broader ratings. The current box office hits range from 7% to 94%. This is because they aggregate binary “positive” and “negative” reviews. As jaime2000 notes, Youtube has switched to a similar rating system and it seems to keep things very sensitive.
I don’t think it’s this. Belgium doesn’t use letter-grading and still succumbs to the problem you mentioned in areas outside the classroom.
What do they use instead?
Points out of a maximum. The teacher is supposed to decide in advance how much points a test will be worth (5, 10, 20 and 25 being common options, but I’ve also had tests where I scored 17,26⁄27) and then decides how much points a question will be worth. You need to get half of the maximum or more for a passing grade.
That’s in high school. In university everything is scored out of a maximum of 20 points.
You may find the work of the authors of http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2369332 interesting.
That’s not an explanation, just a symptom of the problem. People of mediocre talent and high talent both get A—that’s part of the reason why we have to use standardized tests with a higher ceiling.
My intuition is that the top few notches are satisficing, whereas all lower ratings are varying degrees of non-satisficing. The degree to which everything tends to cluster at the top represents the degree to which everything is satisfactory for practical purposes. In situations where the majority of the rated things are not satisfactory (like the Putnam—nothing less than a correct proof is truly satisfactory), the ratings will cluster near the bottom.
For example, compare motels to hotels. Motels always have fewer stars, because motels in general are worse. Whereas, say, video games will tend to cluster at the top because video games in general are satisfactorily fun.
Or, think Humanities vs. Engineering grades. Humanities students in general satisfy the requirements to be historians and writers or liberal-arts-educated-white-collar workers more than Engineering students satisfy the requirements to be engineers.
This is what I was trying to convey when I said it might be another example of the problem.
I think it’s reasonable, in many contexts, to say that achieving 75% of the highest possible score on an exam should earn you what most people think of as a C grade (that is, good enough to proceed with the next part of your education, but not good enough to be competitive).
I would say that games are different. There is not, as far as I know, a quantitative rubric for scoring a game. A 6⁄10 rating on a game does not indicate that the game meets 60% of the requirements for a perfect game. It really just means that it’s similar in quality to other games that have received the same score, and usually a 6⁄10 game is pretty lousy. I found a histogram of scores on metacritic:
http://www.giantbomb.com/profile/dry_carton/blog/metacritic-score-distribution-graphs/82409/
The peak of the distributions seems to be around 80%, while I’d eyeball the median to be around 70-75%. There is a long tail of bad games. You may be right that this distribution does, in some sense, reflect the actual distribution of game quality. My complaint is that this scoring system is good at resolving bad games from truly awful games from comically terrible games, but it is bad at resolving a good game from a mediocre game.
What I think it should be is a percentile-based score, like Lumifer describes:
Then again, maybe it’s difficult to discern a difference in quality between a 60th percentile game and an 80th percentile game.
Oh right, I didn’t read carefully sorry.
I’ve noticed the same thing. Part of it might be that reviewers are reluctant to alienate fans of [thing being reviewed]. Another explanation is that they are intuitively norming against a wider degree of things than they actually review. For example, I was buying a smartphone recently, and a lot of lower-end devices I was considering had few reviews, but famous high-end brands (like iPhone Galaxy S, etc.) are reviewed by pretty much everyone.
Playing devil’s advocate, it might be that there are more perceivable degrees of badness/more ways to fail than there are of goodness, so we need a wider range of numbers to describe and fairly rank the failures.
Well here is an article by Megan McArdle that talking about how insider-outsider dynamics can lead to this kind of rank inflation.
Math competitions often have the opposite problem. The Putnam competition, for example, often has a median score of 0 or 1 out of 120.
I’m not sure this is a good thing. Participating in a math competition and getting 0 points is pretty discouraging, in a field where self-esteem is already an issue.
Interestingly enough, the scores on individual questions are extremely bimodal. They’re theoretically out of 10 but the numbers between 3 and 7 are never used.
In medicine we try to make people rate their symptoms, like pain, from one to ten. It’s pretty much never under 5. Of course there’s a selection effect and people don’t like to look like whiners but I’m not convinced these fully explain the situation.
In Finland the lowest grade you can get from primary education to high school is 4 so that probably affects the situation too.
How do you then interpret their responses? Do you compare only the responses of the same person at different times, or between persons (or to guide initial treatment)? Do you have a reference scale that translates self-reported pain to something with an objective referent?
Yes. There’s too much variation between persons. I also think there’s variation between types of pain and variation depending on whether there are other symptoms. There are no objective specific referents but people who are in actual serious pain usually look like it, are tachycardic, hypertensive, aggressive, sweating, writhing or very still depending on what type of pain were talking about. Real pain is also aggravated by relevant manual examinations.
This is actually what initially got me thinking about this. I read a half-satire thing about people misusing pain scales. Since my only source for the claim that people do this was a somewhat satirical article, I didn’t bring it up initially.
I was surprised when I heard that people do this, because I figured most people getting asked that question aren’t in near as much pain as they could be, and they don’t have much to gain by inflating their answer. When I’ve been asked to give an answer on the pain scale, I’ve almost always felt like I’m much closer to no pain than to “the worst pain I can imagine” (which is what I was told a ten is), and I can imagine being in such awful pain that I can’t answer the question. I think I answered seven one time when I had a bone sticking through my skin (which actually hurt less than I might have thought).
Maybe they think that by inflating their answer they gain, on the margin, better / more intensive / more prompt medical service. Especially in an ER setting where they may intuit themselves to be competing against other patients being triaged and asked the same question, they might perceive themselves (consciously or not) to be in an arms race where the person who claims to be experiencing the most pain gets treated first.
This is exactly why in my family we use +2/-2. 0 really does feel like average in a way 5-6/10 or 3⁄5 doesn’t.
I tried to change out the 10 rating for a z-score rating in my own conversations. It failed due to my social circles not being familiar with the normal bell curve.
If you wanted to maximize the informational content of your ratings, wouldn’t you try to mimick a uniform distribution?
The intent was to communicate one piece of information without confusion: where on the measurement spectrum the item fits relative to others in its group. As opposed to delivering as much information as possible, for which there are more nuanced systems.
Most things I am rating do not have a uniform distribution, I tried to follow a normal distribution because it would fit the greater majority of cases. We lose information and make assumptions when we measure data on the wrong distribution, did you fit to uniform by volume or by value? It was another source of confusion.
As mentioned, this method did fail. I changed my methods to saying ‘better than 90% of the items in its grouping’ and had moderate success. While solving the uniform/normal/Chi-squared distribution problem it is still too long winded for my tastes.
The distribution of your ratings does not need to follow the distribution of what you are rating. For maximum information your (integer) rating should point to a quantile—e.g. if you’re rating on a 1-10 scale your rating should match the decile into which the thing being rated falls. And if your ratings correspond to quantiles, the ratings themselves are uniformly distributed.
We have different goals. I want to my rating to reflect the items relative position in its group, you want a rating to reflect the items value independent of the group.
Is this accurate?
Doesn’t seem so. If you rate by quintiles your rating effectively indicates the rank of the bucket to which the thing-being-rated belongs. This reflects “the item’s relative position in its group”.
If you want your rating to reflect not a rank but something external, you can set up a variety of systems, but I would expect that for max information your rating would have to point a quintile of that external measure of the “value independent of the group”.
Trying to stab at the heart of the issue: I want the distribution of the ratings to follow the distribution of the rated because when looking at the group this provides an additional piece of information.
Well, at this point the issue becomes who’s looking at your rating. This “additional piece of information” exists only for people who have a sufficiently large sample of your previous ratings so they understand where the latest rating fits in the overall shape of all your ratings.
Consider this example: I come up to you and ask “So, how was the movie?”. You answer “I give it a 6 out of 10″. Fine. I have some vague idea of what you mean. Now we wave a magic wand and bifurcate reality.
In branch 1 you then add “The distribution of my ratings follows the distribution of movie quality, savvy?” and let’s say I’m sufficiently statistically savvy to understand that. But… does it help me? I don’t know the distribution of movie quality. it’s probably bell-shaped, maybe, but not quite normal if only because it has to be bounded, I have no idea if its skewed, etc.
In branch 2 you then add “The rating of 6 means I rate the movie to be in the sixth decile”. Ah, that’s much better. I now know that out of 10 movies that you’ve seen five were probably worse and three were probably better. That, to me, is a more useful piece of information.
I understand and concede to the better logic. This provides greater insight on why the original attempt to use these ratings failed.
Quite often the difference between the top 10 percent is higher than the difference of the people between 45% and 55%.
IQ scales have more people in the middle than on the edges.
As far as I remember, IQs are normalized ranks so to answer the question which 10% is “wider” you need to define by which measure.
I think it’s the C thing. I have no evidence for this.