Thursday, May 8, 2014

Effective Use of Amazon Mechanical Turk - Neeraj Kumar

Neeraj Kumar published  Effective Use of Amazon Mechanical Turk in May 2013 and updated in May 2014.

He posted a link to this blog in his paper and that is how I discovered it. It is an excellent paper discussing how to be a requester on Mechanical Turk and uncovers some of the pitfalls that many requesters face when trying to use Mturk.

This is a requester who "gets it" for the most part and there is a very helpful Q&A section at the bottom of the paper. Yet the focus of the HITs he published and the Q&A were academic hits. The data was not needed by a for-profit corporation where accuracy, speed and competency are required aspects of the data as opposed to being a strictly academic nature. They both deserve and require accurate results, but in an academic setting, having multiple people complete the same work is often an aspect of the research and is not required or needed by non-academic requesters.

I have written Neeraj with a link to this post and he might consider revising some of the issues I noticed in the Q&A section. There is a lot of excellent advice in this paper, but as he states, some of it may no longer be relevant.
I will quote the sections and respond.

Question 3

Question: I pick 1 cent per 5-second task which turns out to give a worker $7.20 per hour. Do you think it's appropriate?
Answer: If the task really takes 5 seconds on average, then 1 cent is slightly on the higher side of what people are usually paid [as of May 2013]. The usual solution is to add more than one 'job' per HIT, so that it takes a bit longer. Remember that you are being charged cost + 10%, or cost + 0.5 cents (whichever is higher) until you get to 5 cents per hit, you're paying proportionally more per HIT to amazon in fees. If your total volume is not very much (~$100), this doesn't matter, but at larger volumes, these costs can add up.
But in general, the thing to understand about costs on mturk is that they don't determine IF your job gets done, but rather WHEN your job gets done. More money == faster completion. It's hard to judge how much money is enough, so usually I start with the bare minimum, submit a small job (~20 HITs or so), and see how long it takes workers to complete that job. If it's too slow, I make the next batch a bit more expensive and repeat this process until I'm happy with the speed.
BTW, this kind of iteration is almost always needed, not just for pricing, but also to evaluate how well workers are completing the task (in terms of accuracy) and also for debugging.
Small problem with the question-  A 5 second task for a standard worker is not 5 seconds. Although the completion time may state 5 seconds, the worker has time involved in submitting the hit and accepting the next hit. So 5 seconds is really more like 8 seconds or $4.50 an hour, not $7.20. Also standard Mturk workers have to solve captcha questions every 25 hits to prove they are not automated bots completing hits, this further reduces their time and earnings.
Small problem with the answer - It is not hard to judge how much money is enough. This is a workplace and if you have real work that needs to be done, you should pay a fair wage to get it completed. If you are looking for simple transcription of business cards and only want Indian workers to complete the task, pay them a fair wage for India. If you want American workers to complete a task, you should start at the minimum wage in the United States and work up from there.
Requesters who view Mturk as a cheap labor force end up with poor results. There are hundreds of requesters on Mturk who are paying workers $2 an hour and less, but they are submitting the same hit 3, 4 or even 10 times in order to achieve a desired result. Why not pay one worker a fair wage to do the job right the first time?
The reason I point out these two countries is because many requesters do not know that India and America are the only two countries that pay workers in cash. The rest of the world is paid in Amazon currency which cannot be converted to cash. These are the countries where people earn a living off of Mturk and they should be paid a fair wage.  There is a huge workforce of excellent workers on Mturk who will not even consider working for less than $10 and up an hour. When you know how to access these workers, you are guaranteed excellent results.

Question 5 
In general, if you design your tasks such that it is not trivially easy to cheat, then spammers are usually not a huge issue. It also helps if your jobs are very small and cheap, as then there is often less incentive for spammers to devote time and resources to figure out how to cheat. ....
A simpler alternative is to just ratchet up the number of workers doing the same task. For simple things like attribute labeling, we required 3 responses per image, but for the face verification task, we had 10 responses each. I think we might also have thrown out outliers from this 10, but I'm not sure about that.
As stated in the article there are worker forums where workers communicate with each other. Most forums are honest and professional, but some are not. It is against Amazon's Terms of Service to automate any part of a HIT, but people write scripts that can auto complete many aspects of simple HIT design, then they share the scripts on unprofessional worker forums. Then, these scammers then work together and set there scripts to mark certain buttons and go full force on batches of HITs that are not designed properly. This defeats the purpose of having multiple workers complete the same HIT.
Using plurality to grade workers is wasting money for the requester and reducing pay for workers. It is unnecessary when you have qualified workers and proper hit design to weed out spammers and scammers.

So what I have seen on scammer forums is these workers feel justified in cheating because the pay is so low, why should they have to click every button or even do good work for a requester who does not value their time. On professional worker forums like Turkernation and Cloudmebaby, this type of behavior is frowned upon and users will be banned for even the slightest discussion of any type of cheating.

Question 6
Question: Can we have like a qualification test ourselves where each worker has to answer a couple of questions to make sure they don't just make a random guess. Do you think this is needed or is the standard 95% qualification option enough. Also if needed, does there exist such feature to do so on Amazon?

Answer: I would first try the simple 95% qualification before you move on to more sophisticated things. Run some smallish batches and see if the results look reasonable. It is possible to add custom qualification tasks, but I've never done them, so I don't know how they work. I think they also drastically cut down on the number of workers who are willing to do them, so only do it if it's absolutely necessary.
 In academic settings 95% is a grade of an "A"  but in the Amazon workforce, it should be regarded as an "F". There are only three reasons a worker has less than a 99% approval rating -
1. They are new workers and do not have a lot of hits completed. A few rejections can change their approval percentage drastically
2. They are foreign workers. These workers do not have as many "good" hits available to them and are forced to work for less than scrupulous requesters who use plurality to grade or just do not care about their workforce. Many of the best HITs are U.S. only.
3. They are cheaters, poor workers or outright scammers.

If using approval percentage alone it should be set to >98% to weed out the wrong workers. I understand that this question and the following questions were about a specific person asking about specific hits, but best practices should always be followed.

Question 8

In particular, the default worker qualifications (under 'Advanced' when creating a new job), now includes 'worker must be Master'. These are workers who have gone through a more stringent review process. While their work quality might be better (I don't have a good sense if this is actually true), this pool of workers is even smaller. So for very simple jobs (where there's little chance of screwing up), it's usually better to uncheck this option. Good replacement criteria are 'worker must have completed at least 1000 jobs' and 'worker approval rate >= 95%' (or thereabouts).
There are quite a few problems with using "master" workers.
You are limiting your pool of workers to a smaller workforce.
You are paying Amazon a premium of 30% per hit as opposed to the standard 10%
Amazon has no transparency on how this "masters qualification" is granted. Since the ban on international workers in 2013, there have been no new international master workers granted as well. If an international worker was not granted "master" status prior to 2013, they do not have a chance of receiving it now. The international pool of master workers is severely limited.

Using total approved HITs is irrelevant for the most part. Workers who are writers and survey takers primarily may take a couple of years to reach 1000 hits approved while other workers will routinely complete over 1000 hits a day.

Custom qualifications are the best way to get the best workers. If you are looking for results from the masses and do not care too much about quality then setting the standard qualifications to over 98% approved and 10,000 completed hits will get pretty good workers, but not the best. If you want the BEST workers and ACCURATE results there is really only one way to go...

1. Join a worker forum.
2. Discuss your HITs with workers prior to publishing anything. This ensures that your instructions can be understood, your workers know what to expect and that you can communicate any changes in instructions directly to your workforce.
3. Qualify workers, publish small batches and make sure that your workers understand the HITs and are completing them properly before releasing a large batch.
4. Pay them a fair wage. There are lots of requesters who pay over $20 per hour for the best workers, you are in competition with them. If you are publishing hits that need to be completed and are paying $8 an hour and one of the workers other requesters has hits up for $15 an hour, who do you think they will work for?
5. Try to set a schedule of when you will be releasing work so that your workforce can will know ot be available when the work is released.

New requesters thinking about using Amazon Mechanical Turk and looking in from the outside only see the negative part of Mturk. Requesters who have been around the block have learned some hard lessons with Amazon and know how to get the best workers and best results. A new requester sees the hits published by unethical requesters like infoscout and LinkedIn  paying less than $2 and thinks  this is the norm, but it is not. The HITs they cannot see are the ones that are being worked on day in and day out by the best that Mturk has to offer. These unethical pay practices have even led to lawsuits that could change the face of this industry and potentially bankrupt companies who are leading the way in these unethical practices.

 Edit: Neeraj has made some modifications to his paper to reflect some points in this post. Thanks!

1 comment:

  1. Wow. Here's a response from a worker:

    Question 4) You get what you pay for. 7.20/hr is only 5 cents below (federal) minimum wage, so that's not bad. I might actually try on your tasks. I've seen tasks that average less than a dollar per hour. You think I even try on those? NO.

    Question 5) Incorrect. I made thousands of dollars off a requester who posted 3 cent/1 second HITS. They were posting thousands of HITS a day, and didn't check work. I made thousands. I didn't try on a single one. I suspect a lot of other people were doing the same, because they no longer post to MTurk.

    Question 6) I actually like this idea. Some requesters are assholes and will reject hundreds/thousands of correctly completed HITS just so they don't have to pay. As someone with 200K+ in their belt, this doesn't hurt me *too* much, but for a new worker, this can be devastating. I'd also do away with the 1000 completed HITS. The >98% DOES hurt me, because I've gotten bitten by a couple of requesters (funnily enough, not the one I scammed). To get my approval rating up .1%, I have to do over 2000 HITS without a single rejection. I'm sitting at 98.6%, and it's probably going to stay there. Amazon interprets anything under 99.0% to be "under 98%", so I get screwed on these.

    Question 8) The master workers qualification is crap, as are most qualifications. It is granted randomly. With over 200K HITS, I have no master's qualifications.