I’d expect this too. But my political science professor says this is surprisingly not the case to the extent you would think (I know this is an argument from authority, but I haven’t bothered to ask him for citations yet, and I do trust him).
Moreover, this is something that can be controlled for.
Q12: Thinking about the candidate that you read about, how relevant do you think the following considerations are to their judgment of right and wrong? (Pick a number on the 1-7 scale.)
(a) Whether or not someone suffered emotionally.
Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(b) Whether or not someone acted unfairly.
Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(c) Whether or not someone’s action showed love for his or her country.
Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(d) Whether or not someone did something disgusting.
Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(e) Whether or not someone enjoyed apple juice.
Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(f) Whether or not someone showed a lack of respect for authority.
Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
Looking at this specific example and imagining myself doing this for $1.50/hour or so (with the implication that my IQ isn’t anywhere close to three digits) -- I can’t possibly give true answers because the question is far too complicated and I can’t afford to spend ten minutes to figure it out. Even if I honestly want to not “cheat”.
Ah, no, let me clarify. It requires intellectual effort to untangle Q12 and understand what actually does it ask you. This is a function of the way it is formulated and has nothing to do with knowing the context or the lack of radio buttons.
It is easy for high-IQ people to untangle such questions in their heads so they don’t pay much attention to this—it’s “easy”. It is hard for low-IQ people to do this, so unless there is incentive for them to actually take the time, spend the effort, and understand the question they are not going to do it.
It’s definitely a good idea to keep the questions simple and I’d plan on paying attention to that. But this question actually was used in an MTurk sample and it went ok.
Regardless, even if the question itself is bad, the general point is that this is one way you can control for whether people are clicking randomly. Another way is to have an item and it’s inverse (“I consider myself an optimistic person” and later “I consider myself a pessimistic person”) and a third way is to run a timer in the questionnaire.
this is one way you can control for whether people are clicking randomly
Let’s be more precise: this is one way you can estimate whether people (or scripts) are clicking randomly. This estimate should come with its own uncertainty (=error bars, more or less) which should be folded into the overall uncertainty of survey results.
Well, the results were consistent with the hypothesis, the distribution of responses didn’t look random, not too many people failed the “apple juice” question, and the timer data looked reasonable.
~
this is one way you can estimate whether people (or scripts) are clicking randomly.
That’s generally what I meant by “control”. But at that point, we might just be nitpicking about words.
Possibly, though I have in mind a difference in meaning or, perhaps, attitude. “Can control” implies to me that you think you can reduce this issue to irrelevance, it will not affect the results. “Will estimate” implies that this is another source of uncertainty, you’ll try to get a handle on it but still it will add to the total uncertainty of the final outcome.
Well, the results were consistent with the hypothesis, the distribution of responses didn’t look random, not too many people failed the “apple juice” question, and the timer data looked reasonable.
Well, the most obvious misinterpretations of the question will also result in people not failing the “apple juice” question.
You check to make sure that “Whether or not someone enjoyed apple juice” is put at 1 or 2 or you throw out the participant. Otherwise, you keep the response.
There are a few other tactics. Another one is to have a question like “I consider myself optimistic” and then later have a question “I consider myself pessimistic” and you check to see if the answers are in an inverse relationship.
Another one is to have a question like “I consider myself optimistic” and then later have a question “I consider myself pessimistic” and you check to see if the answers are in an inverse relationship.
And if they are, you mark the person as bipolar :-D
Plus, of course, there is huge selection bias. How many people with regular jobs, for example, do you think spend their evenings doing MTurk jobs?
I discuss this in the “Diversity of the Sample” subsection of the “Is Mechanical Turk a Reliable Source of Data?” section.
The question is not “is MTurk representative?” but rather “Is MTurk representative enough to be useful in answering the kinds of questions we want to answer and quicker / cheaper than our alternative sample sources?”.
The first question is “Can you trust the data coming out of MTurk surveys?”
The paper which your link references is behind the paywall but it seems likely to me that they gathered the data on representativeness of MTurk workers through a survey of MTurk workers. Is there a reason to trust these numbers?
The paper which your link references is behind the paywall
Which one? I can make it publicly available.
but it seems likely to me that they gathered the data on representativeness of MTurk workers through a survey of MTurk workers. Is there a reason to trust these numbers?
You can compare the answers to other samples.
Unless, of course, your concern is that the subjects are lying about their demographics, which is certainly possible. But then, it would be pretty amazing that this mix of lies and truths creates a believable sample. And what would be the motivation to lie about demographics? Would this motivation be any higher than other surveys? Do you doubt the demographics in non-MTurk samples?
I actually do agree this is a risk, so we’d have to (1) maybe run a study first to gauge how often MTurkers lie, perhaps using the Marlowe-Crowne Social Desirability Inventory, and/or (2) look through MTurker forums to see if people talk about lying on demographics. (One demographic that is known to be fabricated fairly often is nationality, because many MTurk tasks are restricted to Americans.)
Instead of dismissing MTurk based on expectations that it would be useless for research, I think it would be important to test it. After all, published social science has made use of MTurk samples, so we have some basis for expecting it to be at least worth testing to see if it’s legitimate.
Unless, of course, your concern is that the subjects are lying about their demographics
Yes. Or, rather, the subjects submit noise as data.
Consider, e.g. a Vietnamese teenager who knows some English and has declared himself as an American to MTurk. He’ll fill out a survey because he’ll get paid for it, but there is zero incentive for him to give true answers (and some questions like “Did you vote for Obama?” are meaningless for him). The rational thing for him to do is to put checkmarks into boxes as quickly as he can without being obvious about his answers being random.
Instead of dismissing MTurk based on expectations that it would be useless for research, I think it would be important to test it.
I’ll rephrase this as “it would be useful and necessary to test it before we use MTurk samples for research”.
The rational thing for him to do is to put checkmarks into boxes as quickly as he can without being obvious about his answers being random.
This is a good point. You still would be able to match the resulting demographics to known trends and see how reliable your sample is, however. Random answers should show, either overtly on checks, or subtlety through aggregate statistics.
~
I’ll rephrase this as “it would be useful and necessary to test it before we use MTurk samples for research”.
A survey designer could ask the same question in different ways, or ask questions with mutually exclusive answers, and then throw away responses with contradictory answers. (This isn’t a perfect cure but it can give an idea of which survey responses are just random marks and which aren’t.)
I would expect the survey fillers to just turk out random checkmarks as fast as they can without reading the questions.
I’d expect this too. But my political science professor says this is surprisingly not the case to the extent you would think (I know this is an argument from authority, but I haven’t bothered to ask him for citations yet, and I do trust him).
Moreover, this is something that can be controlled for.
How would you practically go about controlling for it?
A fourth way: include a reading passage and then, on a separate page, a question to test to see if they read the passage.
Another thing you can do is put a timer in the survey that keeps track of how much time they spend on each question.
Here’s one example:
Q12: Thinking about the candidate that you read about, how relevant do you think the following considerations are to their judgment of right and wrong? (Pick a number on the 1-7 scale.)
(a) Whether or not someone suffered emotionally. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(b) Whether or not someone acted unfairly. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(c) Whether or not someone’s action showed love for his or her country. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(d) Whether or not someone did something disgusting. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(e) Whether or not someone enjoyed apple juice. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(f) Whether or not someone showed a lack of respect for authority. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
Looking at this specific example and imagining myself doing this for $1.50/hour or so (with the implication that my IQ isn’t anywhere close to three digits) -- I can’t possibly give true answers because the question is far too complicated and I can’t afford to spend ten minutes to figure it out. Even if I honestly want to not “cheat”.
Well, there are two reasons why that would be the case:
1.) This question refers to a specific story that you would have read previously in the study.
2.) The formatting here is jumbled text. The format of the actual survey includes radio buttons and is much nicer.
Ah, no, let me clarify. It requires intellectual effort to untangle Q12 and understand what actually does it ask you. This is a function of the way it is formulated and has nothing to do with knowing the context or the lack of radio buttons.
It is easy for high-IQ people to untangle such questions in their heads so they don’t pay much attention to this—it’s “easy”. It is hard for low-IQ people to do this, so unless there is incentive for them to actually take the time, spend the effort, and understand the question they are not going to do it.
It’s definitely a good idea to keep the questions simple and I’d plan on paying attention to that. But this question actually was used in an MTurk sample and it went ok.
Regardless, even if the question itself is bad, the general point is that this is one way you can control for whether people are clicking randomly. Another way is to have an item and it’s inverse (“I consider myself an optimistic person” and later “I consider myself a pessimistic person”) and a third way is to run a timer in the questionnaire.
What does “went ok” mean and how do you know it?
Let’s be more precise: this is one way you can estimate whether people (or scripts) are clicking randomly. This estimate should come with its own uncertainty (=error bars, more or less) which should be folded into the overall uncertainty of survey results.
Well, the results were consistent with the hypothesis, the distribution of responses didn’t look random, not too many people failed the “apple juice” question, and the timer data looked reasonable.
~
That’s generally what I meant by “control”. But at that point, we might just be nitpicking about words.
Possibly, though I have in mind a difference in meaning or, perhaps, attitude. “Can control” implies to me that you think you can reduce this issue to irrelevance, it will not affect the results. “Will estimate” implies that this is another source of uncertainty, you’ll try to get a handle on it but still it will add to the total uncertainty of the final outcome.
Well, the most obvious misinterpretations of the question will also result in people not failing the “apple juice” question.
What cut of criteria would you use with those questions to avoid cherry picking of data?
You check to make sure that “Whether or not someone enjoyed apple juice” is put at 1 or 2 or you throw out the participant. Otherwise, you keep the response.
There are a few other tactics. Another one is to have a question like “I consider myself optimistic” and then later have a question “I consider myself pessimistic” and you check to see if the answers are in an inverse relationship.
And if they are, you mark the person as bipolar :-D
If the controlling is effective, having to discard some of the answers still drives up the cost.
Yes. But by too much to make it no longer worth doing? I don’t know.
That.
Plus, of course, there is huge selection bias.How many people with regular jobs, for example, do you think spend their evenings doing MTurk jobs?
But yes, the real issue is that you’ll have a great deal of noise in your responses and you will have serious issues trying to filter it out.
I discuss this in the “Diversity of the Sample” subsection of the “Is Mechanical Turk a Reliable Source of Data?” section.
The question is not “is MTurk representative?” but rather “Is MTurk representative enough to be useful in answering the kinds of questions we want to answer and quicker / cheaper than our alternative sample sources?”.
The first question is “Can you trust the data coming out of MTurk surveys?”
The paper which your link references is behind the paywall but it seems likely to me that they gathered the data on representativeness of MTurk workers through a survey of MTurk workers. Is there a reason to trust these numbers?
Which one? I can make it publicly available.
You can compare the answers to other samples.
Unless, of course, your concern is that the subjects are lying about their demographics, which is certainly possible. But then, it would be pretty amazing that this mix of lies and truths creates a believable sample. And what would be the motivation to lie about demographics? Would this motivation be any higher than other surveys? Do you doubt the demographics in non-MTurk samples?
I actually do agree this is a risk, so we’d have to (1) maybe run a study first to gauge how often MTurkers lie, perhaps using the Marlowe-Crowne Social Desirability Inventory, and/or (2) look through MTurker forums to see if people talk about lying on demographics. (One demographic that is known to be fabricated fairly often is nationality, because many MTurk tasks are restricted to Americans.)
Instead of dismissing MTurk based on expectations that it would be useless for research, I think it would be important to test it. After all, published social science has made use of MTurk samples, so we have some basis for expecting it to be at least worth testing to see if it’s legitimate.
The paper I mean is that one: http://cpx.sagepub.com/content/early/2013/01/31/2167702612469015.abstract
Yes. Or, rather, the subjects submit noise as data.
Consider, e.g. a Vietnamese teenager who knows some English and has declared himself as an American to MTurk. He’ll fill out a survey because he’ll get paid for it, but there is zero incentive for him to give true answers (and some questions like “Did you vote for Obama?” are meaningless for him). The rational thing for him to do is to put checkmarks into boxes as quickly as he can without being obvious about his answers being random.
I’ll rephrase this as “it would be useful and necessary to test it before we use MTurk samples for research”.
Here you go.
~
This is a good point. You still would be able to match the resulting demographics to known trends and see how reliable your sample is, however. Random answers should show, either overtly on checks, or subtlety through aggregate statistics.
~
Definitely.
A survey designer could ask the same question in different ways, or ask questions with mutually exclusive answers, and then throw away responses with contradictory answers. (This isn’t a perfect cure but it can give an idea of which survey responses are just random marks and which aren’t.)