Shmi comments on Doing Important Research on Amazon’s Mechanical Turk?

Shmi 25 Sep 2013 18:03 UTC
5 points
0
I would expect the survey fillers to just turk out random checkmarks as fast as they can without reading the questions.
- Peter Wildeford 25 Sep 2013 20:45 UTC
  8 points
  0
  Parent
  I’d expect this too. But my political science professor says this is surprisingly not the case to the extent you would think (I know this is an argument from authority, but I haven’t bothered to ask him for citations yet, and I do trust him).
  
  Moreover, this is something that can be controlled for.
  - ChristianKl 27 Sep 2013 11:02 UTC
    1 point
    0
    Parent
    
    Moreover, this is something that can be controlled for.
    
    How would you practically go about controlling for it?
    - Peter Wildeford 29 Sep 2013 3:24 UTC
      2 points
      0
      Parent
      A fourth way: include a reading passage and then, on a separate page, a question to test to see if they read the passage.
    - Peter Wildeford 27 Sep 2013 19:30 UTC
      2 points
      0
      Parent
      Another thing you can do is put a timer in the survey that keeps track of how much time they spend on each question.
    - Peter Wildeford 27 Sep 2013 12:41 UTC
      1 point
      0
      Parent
      Here’s one example:
      
      Q12: Thinking about the candidate that you read about, how relevant do you think the following considerations are to their judgment of right and wrong? (Pick a number on the 1-7 scale.)
      
      (a) Whether or not someone suffered emotionally. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      
      (b) Whether or not someone acted unfairly. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      
      (c) Whether or not someone’s action showed love for his or her country. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      
      (d) Whether or not someone did something disgusting. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      
      (e) Whether or not someone enjoyed apple juice. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      
      (f) Whether or not someone showed a lack of respect for authority. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      - Lumifer 27 Sep 2013 16:26 UTC
        0 points
        0
        Parent
        Looking at this specific example and imagining myself doing this for $1.50/hour or so (with the implication that my IQ isn’t anywhere close to three digits) -- I can’t possibly give true answers because the question is far too complicated and I can’t afford to spend ten minutes to figure it out. Even if I honestly want to not “cheat”.
        Peter Wildeford 27 Sep 2013 17:54 UTC
        2 points
        0
        Parent
        Well, there are two reasons why that would be the case:
        
        1.) This question refers to a specific story that you would have read previously in the study.
        
        2.) The formatting here is jumbled text. The format of the actual survey includes radio buttons and is much nicer.
        Lumifer 27 Sep 2013 18:57 UTC
        2 points
        0
        Parent
        Ah, no, let me clarify. It requires intellectual effort to untangle Q12 and understand what actually does it ask you. This is a function of the way it is formulated and has nothing to do with knowing the context or the lack of radio buttons.
        
        It is easy for high-IQ people to untangle such questions in their heads so they don’t pay much attention to this—it’s “easy”. It is hard for low-IQ people to do this, so unless there is incentive for them to actually take the time, spend the effort, and understand the question they are not going to do it.
        Peter Wildeford 27 Sep 2013 19:33 UTC
        2 points
        0
        Parent
        It’s definitely a good idea to keep the questions simple and I’d plan on paying attention to that. But this question actually was used in an MTurk sample and it went ok.
        
        Regardless, even if the question itself is bad, the general point is that this is one way you can control for whether people are clicking randomly. Another way is to have an item and it’s inverse (“I consider myself an optimistic person” and later “I consider myself a pessimistic person”) and a third way is to run a timer in the questionnaire.
        Lumifer 27 Sep 2013 19:55 UTC
        0 points
        0
        Parent
        
        and it went ok
        
        What does “went ok” mean and how do you know it?
        
        this is one way you can control for whether people are clicking randomly
        
        Let’s be more precise: this is one way you can estimate whether people (or scripts) are clicking randomly. This estimate should come with its own uncertainty (=error bars, more or less) which should be folded into the overall uncertainty of survey results.
        Peter Wildeford 27 Sep 2013 23:47 UTC
        2 points
        0
        Parent
        
        What does “went ok” mean and how do you know it?
        
        Well, the results were consistent with the hypothesis, the distribution of responses didn’t look random, not too many people failed the “apple juice” question, and the timer data looked reasonable.
        
        ~
        
        this is one way you can estimate whether people (or scripts) are clicking randomly.
        
        That’s generally what I meant by “control”. But at that point, we might just be nitpicking about words.
        Lumifer 30 Sep 2013 16:00 UTC
        0 points
        0
        Parent
        
        we might just be nitpicking about words
        
        Possibly, though I have in mind a difference in meaning or, perhaps, attitude. “Can control” implies to me that you think you can reduce this issue to irrelevance, it will not affect the results. “Will estimate” implies that this is another source of uncertainty, you’ll try to get a handle on it but still it will add to the total uncertainty of the final outcome.
        Eugine_Nier 28 Sep 2013 14:13 UTC
        −2 points
        0
        Parent
        
        Well, the results were consistent with the hypothesis, the distribution of responses didn’t look random, not too many people failed the “apple juice” question, and the timer data looked reasonable.
        
        Well, the most obvious misinterpretations of the question will also result in people not failing the “apple juice” question.
      - ChristianKl 27 Sep 2013 13:13 UTC
        0 points
        0
        Parent
        What cut of criteria would you use with those questions to avoid cherry picking of data?
        Peter Wildeford 27 Sep 2013 17:56 UTC
        2 points
        0
        Parent
        You check to make sure that “Whether or not someone enjoyed apple juice” is put at 1 or 2 or you throw out the participant. Otherwise, you keep the response.
        
        There are a few other tactics. Another one is to have a question like “I consider myself optimistic” and then later have a question “I consider myself pessimistic” and you check to see if the answers are in an inverse relationship.
        Lumifer 27 Sep 2013 20:08 UTC
        0 points
        0
        Parent
        
        Another one is to have a question like “I consider myself optimistic” and then later have a question “I consider myself pessimistic” and you check to see if the answers are in an inverse relationship.
        
        And if they are, you mark the person as bipolar :-D
  - chaosmage 26 Sep 2013 16:53 UTC
    0 points
    0
    Parent
    
    this is something that can be controlled for.
    
    If the controlling is effective, having to discard some of the answers still drives up the cost.
    - Peter Wildeford 26 Sep 2013 20:07 UTC
      1 point
      0
      Parent
      Yes. But by too much to make it no longer worth doing? I don’t know.
- Lumifer 25 Sep 2013 18:50 UTC
  4 points
  0
  Parent
  That.
  
  Plus, of course, there is huge selection bias.How many people with regular jobs, for example, do you think spend their evenings doing MTurk jobs?
  
  But yes, the real issue is that you’ll have a great deal of noise in your responses and you will have serious issues trying to filter it out.
  - Peter Wildeford 25 Sep 2013 20:46 UTC
    5 points
    0
    Parent
    
    Plus, of course, there is huge selection bias. How many people with regular jobs, for example, do you think spend their evenings doing MTurk jobs?
    
    I discuss this in the “Diversity of the Sample” subsection of the “Is Mechanical Turk a Reliable Source of Data?” section.
    
    The question is not “is MTurk representative?” but rather “Is MTurk representative enough to be useful in answering the kinds of questions we want to answer and quicker / cheaper than our alternative sample sources?”.
    - Lumifer 26 Sep 2013 14:36 UTC
      0 points
      0
      Parent
      The first question is “Can you trust the data coming out of MTurk surveys?”
      
      The paper which your link references is behind the paywall but it seems likely to me that they gathered the data on representativeness of MTurk workers through a survey of MTurk workers. Is there a reason to trust these numbers?
      - Peter Wildeford 26 Sep 2013 15:14 UTC
        1 point
        0
        Parent
        
        The paper which your link references is behind the paywall
        
        Which one? I can make it publicly available.
        
        but it seems likely to me that they gathered the data on representativeness of MTurk workers through a survey of MTurk workers. Is there a reason to trust these numbers?
        
        You can compare the answers to other samples.
        
        Unless, of course, your concern is that the subjects are lying about their demographics, which is certainly possible. But then, it would be pretty amazing that this mix of lies and truths creates a believable sample. And what would be the motivation to lie about demographics? Would this motivation be any higher than other surveys? Do you doubt the demographics in non-MTurk samples?
        
        I actually do agree this is a risk, so we’d have to (1) maybe run a study first to gauge how often MTurkers lie, perhaps using the Marlowe-Crowne Social Desirability Inventory, and/or (2) look through MTurker forums to see if people talk about lying on demographics. (One demographic that is known to be fabricated fairly often is nationality, because many MTurk tasks are restricted to Americans.)
        
        Instead of dismissing MTurk based on expectations that it would be useless for research, I think it would be important to test it. After all, published social science has made use of MTurk samples, so we have some basis for expecting it to be at least worth testing to see if it’s legitimate.
        Lumifer 26 Sep 2013 15:27 UTC
        3 points
        0
        Parent
        The paper I mean is that one: http://cpx.sagepub.com/content/early/2013/01/31/2167702612469015.abstract
        
        Unless, of course, your concern is that the subjects are lying about their demographics
        
        Yes. Or, rather, the subjects submit noise as data.
        
        Consider, e.g. a Vietnamese teenager who knows some English and has declared himself as an American to MTurk. He’ll fill out a survey because he’ll get paid for it, but there is zero incentive for him to give true answers (and some questions like “Did you vote for Obama?” are meaningless for him). The rational thing for him to do is to put checkmarks into boxes as quickly as he can without being obvious about his answers being random.
        
        Instead of dismissing MTurk based on expectations that it would be useless for research, I think it would be important to test it.
        
        I’ll rephrase this as “it would be useful and necessary to test it before we use MTurk samples for research”.
        Peter Wildeford 26 Sep 2013 20:12 UTC
        1 point
        0
        Parent
        
        The paper I mean is that one: http://cpx.sagepub.com/content/early/2013/01/31/2167702612469015.abstract
        
        Here you go.
        
        ~
        
        The rational thing for him to do is to put checkmarks into boxes as quickly as he can without being obvious about his answers being random.
        
        This is a good point. You still would be able to match the resulting demographics to known trends and see how reliable your sample is, however. Random answers should show, either overtly on checks, or subtlety through aggregate statistics.
        
        ~
        
        I’ll rephrase this as “it would be useful and necessary to test it before we use MTurk samples for research”.
        
        Definitely.
- satt 25 Sep 2013 23:23 UTC
  2 points
  0
  Parent
  A survey designer could ask the same question in different ways, or ask questions with mutually exclusive answers, and then throw away responses with contradictory answers. (This isn’t a perfect cure but it can give an idea of which survey responses are just random marks and which aren’t.)