Collecting Quality MTurk Data

While collecting data from Amazon Mechanical Turk (MTurk) for my research, I discovered some serious quality issues despite closely following recommendations from various scholarly articles (Casey et. al., 2017; Fleischer et. al., 2015, Kennedy et. al., 2020; Paolacci & Chandler, 2014). I was finding approximately 80-90% of responses were inattentive. Troublingly, many of these inattentive responses would not have been found if I didn’t manually comb through the 800+ rows of data with a rather rigorous set of criteria. Thankfully, I was able to refine my process and reverse the rate of inattentiveness - now only about 20% of my responses are inattentive.

In this article, I’ll walk through what I did to achieve this better rate of attentiveness in hopes that you’ll find it useful for your own research. For now, this article is password protected because I eventually intend to use these findings for publication in the future. In the meantime though, I hope anyone who reads this can use what I’ve learned to improve the quality of their MTurk data.

1) Run a Prequalifying HIT

Before you ever even send your study out to MTurk workers, you first need to “qualify” your workers for your specific study. This means that they’ve passed some sort of short qualifying task to prove they’ll be attentive when they get the real study.

Using my study as an example, I needed workers who would be attentive to instructional videos. So to pre-qualify them for the study before sending it out, I set up an MTurk HIT that required them to watch a short 60-second video that tested their attentiveness to video content. In the video, I provided a set of steps that they had to follow to find the correct survey code based on their worker ID.

I ran three batches resulting in over 400 HITs with this method. I paid 15 cents for each correct response. The video made it very easy to find and not pay workers who responded with the wrong survey code (e.g. they just put in their workerID or the ID on-screen without watching the video to discover the actual code like “Green Fox” that correlated to their worker ID).

2) Analyze Worker Responses to the Pre-qualifying HIT

Once I had my responses, I imported my list of workers into this Google Sheet and wrote a formula to identify which survey codes actually correlated correctly to each worker ID. Using my formula, I then flagged those workers in MTurk with a custom qualification type that I made called “Attentive to Video Content” (circled in red in the screenshot below).

Those who passed were assigned a 1 and those who failed were assigned a 0. A little over half passed the task, leaving me with 289 workers who were identified as attentive to video content. I rejected the rest of the workers, so I only had to pay those who actually paid attention to the video.

Now, I don’t think you need to necessarily go through the same rigors with a hidden code in a video the way I did, but you do need to run a HIT that allows workers to prove to you that they’ll be attentive in your study. Determine a short task that would work well for your study, then flag workers that passed.

3) Conduct your Study

Now that you know which workers will be attentive, use your custom qualification type in MTurk. When editing the HIT, you’ll see the custom qualifier as an option at the bottom of the drop-down. You should be able to set up qualifications that look something like this (note my custom qualifier is circled below):

I also found that quality in responses does indeed go up when you pay more per HIT. My study took participants an average of 15 minutes to complete, so I originally offered $1.81 per hit (putting my payment at the federal minimum wage of $7.24/hour). I thought minimum wage would be sufficient, (especially compared to the $1 commonly used in academia), but I found that it made workers flag me as a low-paying requester on TurkerView. I even got a few emails from workers complaining that the payment was too low.

https://turkerview.com/requesters/A2DDB025KWS4HK

I bumped my payment per HIT up to $2.50, making it worth $10/hour. This placed it in the “green” range in TurkerView. I believe this also contributed to higher-quality and more attentive responses.

One last thing to note, I ran my batches in sets of 9 to avoid paying a higher MTurk fee. This was nice, but it did mean I was vulnerable to duplicate entries, especially since my user base was less than 300 workers. To mitigate this, I made a second custom qualifier that flagged workers who had already completed my study.

4) Review the Data

As a last step, you’ll still need to look at each response to ensure it’s attentive. I built “attentiveness catchers” into my study to make it easier for me to find inattentive responses. For example, at the beginning of the study, I asked respondents for their age, then at the end I asked them for their birth year. If the age and year didn’t align, they were flagged as potentially inattentive. I then looked more closely at their open-ended responses to see if they were actually paying attention, or if they just copied their answers from an internet search. I also asked respondents for their state and city of residence in two separate questions. If the city didn’t align with the state, they were also flagged for review.

Any response that was found to be inattentive was labeled as such in the dataset and I rejected their work in MTurk. Again, I only paid respondents who were genuinely attentive to the content.

Last Thoughts

It was honestly becoming a nightmare to get quality data before I found this process. I was originally only using the three recommended qualifiers (HIT approval rate > 95%, from the US, more than 50 HITs approved) and the longer my study ran, the less and less reliable the data was. About a month into data collection, I had several batches of 9 in which 0 of the responses were attentive. Luckily, one of the workers who complained to me via email was willing to chat. I explained my study to him and he suggested I look into this whole prequalifying HIT option. I’m glad he did!

References

Casey, Logan S., Jesse Chandler, Adam Seth Levine, Andrew Proctor, & Dara Z. Strolovitch. (2017). Intertemporal Differences Among MTurk Workers: Time-Based Sample Variations and Implications for Online Data Collection. SAGE Open, 7. https://doi.org/10.1177/2158244017712774

Fleischer, Mead, A. D., & Huang, J. (2015). Inattentive Responding in MTurk and Other Online Samples. Industrial and Organizational Psychology, 8(2), 196–202. https://doi.org/10.1017/iop.2015.25

Kennedy, Clifford, S., Burleigh, T., Waggoner, P. D., Jewell, R., & Winter, N. J. G. (2020). The shape of and solutions to the MTurk quality crisis. Political Science Research and Methods, 8(4), 614–629. https://doi.org/10.1017/psrm.2020.6

Paolacci, G., & Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a participant pool. Current directions in psychological science, 23(3), 184-188.