Tuesday, October 21, 2014

Hadoop_Troubleshooting: Job hangs at "map 0% reduce 0%" with logs "Reduce slow start threshold not met"

When I submitting an example hadoop task as below:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi -Dmapred.job.queue.name=root.supertool 10 10000
The progress gets stuck at "map 0% reduce 0%", with job logs:
2014-10-22 13:10:46,703 INFO [Thread-48] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: mapResourceReqt:1536
2014-10-22 13:10:46,770 INFO [Thread-48] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: reduceResourceReqt:3072
2014-10-22 13:10:46,794 INFO [eventHandlingThread] org.apache.hadoop.conf.Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
2014-10-22 13:10:47,368 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:1 ScheduledMaps:10 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0
2014-10-22 13:10:47,602 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1413952943653_0002: ask=8 release= 0 newContainers=0 finishedContainers=0 resourcelimit= knownNMs=6
2014-10-22 13:10:47,605 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
2014-10-22 13:10:47,605 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 0
2014-10-22 13:10:47,607 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=0
2014-10-22 13:10:47,607 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 1

After google and experiment, It is more likely that hadoop job hangs at "Reduce slow start threshold not met" when there is not enough resource, like memory or vcore.

In my case, I rechecked $HADOOP_HOME/etc/hadoop/fair-scheduler.xml, and found that the vcores in root.supertool queue was accidentally set to zero:
<queue name="supertool">

  <minResources>10000 mb,0vcores</minResources>

  <maxResources>90000 mb,0vcores</maxResources>

  <maxRunningApps>50</maxRunningApps>

  <weight>1.0</weight>

  <schedulingPolicy>fair</schedulingPolicy>

  <minSharePreemptionTimeout>300</minSharePreemptionTimeout>

</queue>

When I set vcores back to normal, the stuck job just continues to go.

P.S. Besides the condition above, unreasonable memory or vcore configuration can also lead to this scenario. Please visit Memory Configuration in Hadoop and VCore Configuration in Hadoop for more reference. 

P.S.Again After I installed hadoop on my Mac and ran "hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi -Dmapred.job.queue.name=root.test 3 100000000", I found it hanged at "Reduce slow start threshold not met. completedMapsForReduceSlowstart 1" again. I can assure that I've configured memory and vcores as instructed in my links above. Then I found that the maxResources for queue root.test in fair-scheduler.xml is set to 500MB, but the 'yarn.scheduler.minimum-allocation-mb', 'mapreduce.map.memory.mb' and 'mapreduce.reduce.memory.mb' is all above 500MB, that is to say, not even a single mapper or reducer can be allocated in this queue. Consequently, we should be aware that the maxResources for a specific queue should be greater than all the three parameters above.


© 2014-2017 jason4zhu.blogspot.com All Rights Reserved 
If transfering, please annotate the origin: Jason4Zhu

No comments:

Post a Comment