To anyone who agrees with an upvote not signalling much: Please reconsider to what extent the value of upvoting is to communicate to the author vs. to other readers. One would assume that, eventually, we’d like LW to have a healthy population of people who haven’t necessarily read OB for years and may not be familiar with Eliezer’s previous work, so won’t realize the higher standard being applied.
Furthermore, I’m pretty sure that Eliezer is capable of reading the comments and comparing scores between articles, so holding him to a higher relative standard isn’t actually providing substantial additional feedback to him. Based on this, it seems to me that rating Eliezer differently than you would other authors is strictly suboptimal.
I can imagine an argument that holding him to a higher standard does provide more information, because he gets more information the closer to the probability of an upvote is to 50%. However, in practice I suspect this is an argument in favour of more upvotes, and in truth I’d be surprised if there isn’t a name for the cognitive bias about judging a thing against a narrower category even when you’re asked to judge it against a wider one.
Your idea about getting closer to 50% probability of an upvote in order to get more information identifies a weakness in the voting system. It doesn’t matter as much for comments, but I think it is inadequate for articles.
Much better than having to put every article into one of three categories—up, down, or neither—would be to have a slider that starts at 0 and can take values between −100 and +100. What we have now is equivalent to something like having −100 to −33.3 all mapped to ‘down’, −33.3 to +33.3 all mapped to neither, and +33.3 to +100 all mapped to ‘up’. Obviously, lots of information is being discarded by design.
Another problem is that votes aren’t normalized with respect to the user that cast the vote. An up vote from a user who rarely votes up should be worth more than one from someone who votes everything up.
Also, there could be distorting effects due to different subsets of readers preferentially reading different subsets of articles. If readers coming to LW without having read OB tend to vote differently (which is plausible since OB folks have not voted for years and may think of not voting up or down as the default, with a vote being for special emphasis), and they tend to read different sorts of articles (simpler articles on easier topics), the articles they read will appear to be wildly more popular.
The slider is an interesting notion. It adds user-interface complexity, and may have incentive problems for users who desire to exert control, but potentially garners a substantially more useful form of information.
At the moment the current score is a strong influence on how I vote on comments: I vote to move the score to the value I’d like it to have. This is somewhat unstable; directly specifying a personal score and taking a median would be less problematic.
The problem of the desire to exert control makes me think that a better option is giving a limited number of double/super/special votes that users can ration out as they see fit. Extra votes that actually mean something.
That’s a good idea. Though I didn’t say it originally, when I mentioned normalization of a vote with respect to the user that cast it, I meant not only that it should be normalized against the average rating of a vote for that user but also against how much the user votes in general—users who rate everything would then have less influence per vote than users who vote less frequently. If that were the case, then people who prefer to ration their votes and use them only for things they feel very strongly about (or have thought carefully about) would not have much less influence on what is popular and the direction of the site, as they currently do.
Having a slider requires a more-sophisticated data analysis, because different people use different rating scales. Typically psychologists use a multi-point scale, then use Rasch analysis (also called multi-item response theory) on the data.
I would say from my experience that a 5-point scale is not big enough; almost everything gets 3 or 4 points, except from the people (about 2% of raters) who binarize the scale by giving everything either a 1 or a 5. Also, people will not use negative ratings, so don’t try to center them on zero. People (or at least Americans) just can’t say “zero is average”.
My instinct would be to have the numbers not be visible to the user. You just have a rectangle with two colors, initially red on the right side and green on the left side. Clicking anywhere inside the rectangle changes the dividing line to be at that location. So clicking 90% of the way towards the right would make the left 90% be green and the right 10% be red. The backend would know that it corresponds to whatever number it corresponds to (+80 according to the scheme I gave earlier), but the user just has a qualitative feel for how much of the mass they’ve allocated to the good (green) color and how much to the bad (red) color.
As you hover over the rating button, the text below changes to indicate what that rating would mean. Zero stars means “don’t bother”, one star means “good enough to stay visible”, two stars means “above-average” and so on
Allow half stars for more information.
We would use percentile score to make the best use of the votes of binarizing voters without giving them more influence than high-information voters.
Amazon ranks stuff between ★☆☆☆☆ and ★★★★★ with a simple Javascript mouse hover / mouse click to set the value. LW could copy that pretty easily. I suggest that 5 categories would be enough.
I can imagine an argument that holding him to a higher standard does provide more information, because he gets more information the closer to the probability of an upvote is to 50%.
Well, let’s look.
The top scoring articles seem to rated in the 50-60 range, indicating at least 60 users who have voted. Eliezer’s articles seem to tend to be rated around 10-20, so that’s probably closer to a 30% chance of upvoting. As far as I could tell none of the top three rated posts are by Eliezer. Yvain seems to be the most consistently highly rated poster overall, with typical scores seemingly ranging from 20-40. Since Yvain roughly mimics Eliezer’s writing style and content, we could probably expect an unbiased rating of Eliezer’s posts to be similar. All around, as a very rough approximation, we can say that Eliezer’s posts are getting an upvote penalty of 50%.
Take all that as you will.
I’d be surprised if there isn’t a name for the cognitive bias about judging a thing against a narrower category even when you’re asked to judge it against a wider one.
I’d imagine there is a name. Whatever it is, I consistently fall prey to it with most intuitive self-evaluations (comparing myself mostly to groups of which I am not a representative member).
One thing you don’t mention is that Yvain’s posts and writing style are simpler and easier to comprehend than Eliezer’s. Yvain has also presented some posts on fairly basic topics that are probably familiar to most longtime OB readers but are new to readers just joining LW. [EDIT: I retract the last point. I was thinking of the ‘priming’ post and that there were others like this on basic heuristics and biases topics, but that seems like the only one.]
That is not to say that there’s not also some bias. I think many of us probably consciously or unconsciously hold Eliezer to much higher standards than anybody else.
All the recent talk about cults and cult-like behavior has probably made some people more hesitant to vote up anything by Eliezer as well.
One thing you don’t mention is that Yvain’s posts and writing style are simpler and easier to comprehend than Eliezer’s.
Not to be contrary, but I actually find Eliezer’s posts easier to comprehend, partly due to better structure and pacing, partly due to a typical slightly higher informational content holding my attention better. I suspect this is mostly a function of Eliezer having more practice, and of my own short attention span, heh.
I was going to say that I expect the cultishness discussion to be more directly relevant to the upvoting penalty, but looking quickly at post scores doesn’t seem to support that theory.
I’d be surprised if there isn’t a name for the cognitive bias about judging a thing against a narrower category even when you’re asked to judge it against a wider one.
It sounds like a form of availability bias, but I agree it needs a more precise term.
To anyone who agrees with an upvote not signalling much: Please reconsider to what extent the value of upvoting is to communicate to the author vs. to other readers. One would assume that, eventually, we’d like LW to have a healthy population of people who haven’t necessarily read OB for years and may not be familiar with Eliezer’s previous work, so won’t realize the higher standard being applied.
Furthermore, I’m pretty sure that Eliezer is capable of reading the comments and comparing scores between articles, so holding him to a higher relative standard isn’t actually providing substantial additional feedback to him. Based on this, it seems to me that rating Eliezer differently than you would other authors is strictly suboptimal.
I can imagine an argument that holding him to a higher standard does provide more information, because he gets more information the closer to the probability of an upvote is to 50%. However, in practice I suspect this is an argument in favour of more upvotes, and in truth I’d be surprised if there isn’t a name for the cognitive bias about judging a thing against a narrower category even when you’re asked to judge it against a wider one.
Your idea about getting closer to 50% probability of an upvote in order to get more information identifies a weakness in the voting system. It doesn’t matter as much for comments, but I think it is inadequate for articles.
Much better than having to put every article into one of three categories—up, down, or neither—would be to have a slider that starts at 0 and can take values between −100 and +100. What we have now is equivalent to something like having −100 to −33.3 all mapped to ‘down’, −33.3 to +33.3 all mapped to neither, and +33.3 to +100 all mapped to ‘up’. Obviously, lots of information is being discarded by design.
Another problem is that votes aren’t normalized with respect to the user that cast the vote. An up vote from a user who rarely votes up should be worth more than one from someone who votes everything up.
Also, there could be distorting effects due to different subsets of readers preferentially reading different subsets of articles. If readers coming to LW without having read OB tend to vote differently (which is plausible since OB folks have not voted for years and may think of not voting up or down as the default, with a vote being for special emphasis), and they tend to read different sorts of articles (simpler articles on easier topics), the articles they read will appear to be wildly more popular.
The slider is an interesting notion. It adds user-interface complexity, and may have incentive problems for users who desire to exert control, but potentially garners a substantially more useful form of information.
At the moment the current score is a strong influence on how I vote on comments: I vote to move the score to the value I’d like it to have. This is somewhat unstable; directly specifying a personal score and taking a median would be less problematic.
The problem of the desire to exert control makes me think that a better option is giving a limited number of double/super/special votes that users can ration out as they see fit. Extra votes that actually mean something.
That’s a good idea. Though I didn’t say it originally, when I mentioned normalization of a vote with respect to the user that cast it, I meant not only that it should be normalized against the average rating of a vote for that user but also against how much the user votes in general—users who rate everything would then have less influence per vote than users who vote less frequently. If that were the case, then people who prefer to ration their votes and use them only for things they feel very strongly about (or have thought carefully about) would not have much less influence on what is popular and the direction of the site, as they currently do.
Having a slider requires a more-sophisticated data analysis, because different people use different rating scales. Typically psychologists use a multi-point scale, then use Rasch analysis (also called multi-item response theory) on the data.
I would say from my experience that a 5-point scale is not big enough; almost everything gets 3 or 4 points, except from the people (about 2% of raters) who binarize the scale by giving everything either a 1 or a 5. Also, people will not use negative ratings, so don’t try to center them on zero. People (or at least Americans) just can’t say “zero is average”.
My instinct would be to have the numbers not be visible to the user. You just have a rectangle with two colors, initially red on the right side and green on the left side. Clicking anywhere inside the rectangle changes the dividing line to be at that location. So clicking 90% of the way towards the right would make the left 90% be green and the right 10% be red. The backend would know that it corresponds to whatever number it corresponds to (+80 according to the scheme I gave earlier), but the user just has a qualitative feel for how much of the mass they’ve allocated to the good (green) color and how much to the bad (red) color.
Two things you could do about that:
As you hover over the rating button, the text below changes to indicate what that rating would mean. Zero stars means “don’t bother”, one star means “good enough to stay visible”, two stars means “above-average” and so on
Allow half stars for more information.
We would use percentile score to make the best use of the votes of binarizing voters without giving them more influence than high-information voters.
Amazon ranks stuff between ★☆☆☆☆ and ★★★★★ with a simple Javascript mouse hover / mouse click to set the value. LW could copy that pretty easily. I suggest that 5 categories would be enough.
See PhilGoetz’s point below: “almost everything gets 3 or 4 points”.
Well, let’s look.
The top scoring articles seem to rated in the 50-60 range, indicating at least 60 users who have voted. Eliezer’s articles seem to tend to be rated around 10-20, so that’s probably closer to a 30% chance of upvoting. As far as I could tell none of the top three rated posts are by Eliezer. Yvain seems to be the most consistently highly rated poster overall, with typical scores seemingly ranging from 20-40. Since Yvain roughly mimics Eliezer’s writing style and content, we could probably expect an unbiased rating of Eliezer’s posts to be similar. All around, as a very rough approximation, we can say that Eliezer’s posts are getting an upvote penalty of 50%.
Take all that as you will.
I’d imagine there is a name. Whatever it is, I consistently fall prey to it with most intuitive self-evaluations (comparing myself mostly to groups of which I am not a representative member).
One thing you don’t mention is that Yvain’s posts and writing style are simpler and easier to comprehend than Eliezer’s. Yvain has also presented some posts on fairly basic topics that are probably familiar to most longtime OB readers but are new to readers just joining LW. [EDIT: I retract the last point. I was thinking of the ‘priming’ post and that there were others like this on basic heuristics and biases topics, but that seems like the only one.]
That is not to say that there’s not also some bias. I think many of us probably consciously or unconsciously hold Eliezer to much higher standards than anybody else.
All the recent talk about cults and cult-like behavior has probably made some people more hesitant to vote up anything by Eliezer as well.
Not to be contrary, but I actually find Eliezer’s posts easier to comprehend, partly due to better structure and pacing, partly due to a typical slightly higher informational content holding my attention better. I suspect this is mostly a function of Eliezer having more practice, and of my own short attention span, heh.
I was going to say that I expect the cultishness discussion to be more directly relevant to the upvoting penalty, but looking quickly at post scores doesn’t seem to support that theory.
It sounds like a form of availability bias, but I agree it needs a more precise term.