Dmytry comments on Better to be testably wrong than to generate nontestable wrongness

Dmytry 20 Mar 2012 16:06 UTC
0 points
0
I can test my hypothesis, you know. With a script that will randomly be first to upvote and downvote comments at random. Willing to make any bet that the final score (say, five days after) won’t be affected by more than 1 vote point?

[I probably shouldn’t discuss the experiment here, but i kind of doubt you guys can precisely neutralize that kind of bias other than by hiding the vote from yourself before voting. You’ll either strongly under-compensate or overcompensate]
- wedrifid 20 Mar 2012 16:35 UTC
  0 points
  0
  Parent
  
  but i kind of doubt you guys can precisely neutralize that kind of bias other than by hiding the vote from yourself before voting.
  
  See also.
  - Dmytry 20 Mar 2012 16:43 UTC
    0 points
    0
    Parent
    Yea, I know. No idea what % of the population has this enabled though, i think not big. Ratings are good for discouraging the trolling, but people end up caring too much about the ratings.
- wedrifid 20 Mar 2012 16:16 UTC
  −2 points
  0
  Parent
  
  I can test my hypothesis, you know. With a script that will randomly be first to upvote and downvote comments at random. Willing to make any bet that the final score (say, five days after) won’t be affected by more than 1 vote point?
  
  Yes. (I’d also support banning you for botting. “Experiments” are an insufficient excuse.)
  - Dmytry 20 Mar 2012 16:17 UTC
    3 points
    0
    Parent
    I’m pretty sure I can get that experiment approved. Let’s make opinions about something testable, to calibrate, then test. Write down how likely you find that effect is greater than 1 (i.e. that after i restore the vote in 5 days, the score is [un]correlated with bot action) edit: btw we need this for proper prior for Bayesian reasoning anyway.
    
    edit: experiment specification: a randomly chosen recently posted comment or post is up or down voted by 1 voting point. In 5 days, it is un-voted. The average score of the comments that were up voted by 1 point is compared to average score of comments that were down voted by 1 point; if there is self reinforcing effect in the scoring, the correlation should be positive; if the community instead ‘attempts to give fair score’ by steering the rating to the value deemed fair, the correlation should be negative. The priors for positive, neutral, and negative correlation are written before the test. The test is conducted in a random week (edit: not sure how many data points you can get out of a week worth of comments tho, may require longer or shorter time).
    
    I’m quite curious myself as of what the outcome would be. I do expect positive correlation, based on well known well studied phenomena of ‘priming’, but my confidence is not very high. And of course I am not going to do it without approval, as that would be unethical.
    - RobertLumley 20 Mar 2012 19:49 UTC
      2 points
      0
      Parent
      You would need a great deal of results to get this accurate to within one karma point, I would think. And since your theorem is about a post, you shouldn’t mix comments with posts. The voting patters on the two are far different. So this would take awhile. Not that that’s a huge problem. I support the idea.
      - Dmytry 20 Mar 2012 23:59 UTC
        −1 points
        0
        Parent
        I think both comments and posts should be evaluated (separately), but i agree that voting patterns are very different.
        
        Regarding how long it’d take, that depends to strength of the effect… what i think is the strongest effect, is that anything negative gets read much more critically—where a + voted post’s assertions will get read and seen in positive light if at all plausible, negative-voted assertions are likely to be immediately challenged (do they compel me to believe style) - it should be general reflex, that’s just being a good Bayesian reasoner, but it leads to circular reasoning problems when everyone’s reasoning this way together.
        
        If I were to have a theory that you guys tend to apply bayesian reasoning in practice when reading posts, the vote spiral would follow as a testable hypothesis. It’s just how that stuff works in networks. Bayesian reasoning requires tracking where the data is originating from.
        RobertLumley 21 Mar 2012 1:02 UTC
        1 point
        0
        Parent
        I think halo effects are really to blame here—if I see something downvoted, I’m far more likely to read it, because it’s more of an exception to the norm. If it’s bad, I may downvote it further. I’m sure this is the case for many.
        
        This is the primary reason I read this post. But I did not downvote this.
    - satt 21 Mar 2012 4:17 UTC
      1 point
      0
      Parent
      I’m actually finding this hypothesis more interesting than the one in your OP (partly because it looks more testable, funnily enough). Bash out a script to watch LW and vote on things as they appear, leave it to generate data as long as one likes, then hey presto. Tiny bit tempted to do it myself, approval or not.
      
      The test is conducted in a random week (edit: not sure how many data points you can get out of a week worth of comments tho, may require longer or shorter time).
      
      The sample size you need to detect an effect depends on that effect’s size. So far, so obvious, so I did a quick & dirty power analysis to get some numbers, although for posts in the discussion section rather than comments. (Posts on main are too infrequent, and I’d expect a smaller effect for comments, so comments would need a bigger sample.) If anyone cares I can throw up my code.
      
      If my numbers are right and you took a sample of 100 upvoted posts and 100 downvoted discussion posts, the bootstrap confidence interval for the effect size would be 3.7-6.8 points wide. Even with a sample of 400 upvoted posts and 400 downvoted (and that’s 3-4 months’ worth of discussion posts), it’d be 2.2-3.0 points wide. So unless the priming effect’s strong (at least 2-4 points) a week of data wouldn’t be conclusive, at least not for posts. Comments might be more doable, though.
      - Dmytry 21 Mar 2012 5:07 UTC
        1 point
        0
        Parent
        Yea, that’ll take a while. We’ll see about testing. The proposed effect can be strong if each next comment is affected by the previous, so that the initial disturbance does not ‘dissolve’ in a larger number. But i kind of doubt. I don’t really care whole ton for votes, i generally take them as a measure of clarity of the point, but any priming most definitely would result in their lower usefulness as a gauge of clarity. Also theres apparently voting via recent comments thread; tbh i nearly forgot you can read comments expanded, as it does seem not to be very interesting due to majority of comments being brief and meaningless outside context.
      - wedrifid 21 Mar 2012 4:19 UTC
        0 points
        0
        Parent
        
        Bash out a script to watch LW and vote on things as they appear, leave it to generate data as long as one likes, then hey presto. Tiny bit tempted to do it myself, approval or not.
        
        I’m becoming increasingly tempted to submit automation detection scripts to the lesswrong codebase.