I am very new to Hadoop and trying to run a MapReduce job on my University's cluster. I have tested my mapper and reducer locally and they seem to work ok, but when using streaming I get this error over and over again.

The goal is to find the difference between maximum and minimum wind speed for each day

My mapper:

#!/usr/bin/env python

import sys

# Read the header line and strip white space
header = [x.strip() for x in sys.stdin.readline().split(',')]

# Find the positions of the columns
date_index = header.index('YearMonthDay')
wind_speed_index = header.index('Wind Speed (kt)')

for line in sys.stdin:
    
    expected_fields = 21 # some lines have an extra field - define here how many fields there should be
    
    fields = line.strip().split(',') # remove any leading or trailing whitespace
    if len(fields) != expected_fields: # if there are more or less fields, ignore the row
        continue  # ignore this line
    date = fields[date_index]
    wind_speed = fields[wind_speed_index]
    if wind_speed == '-':  # if wind_speed is '-', ignore this line
        continue
    print('%s\t%s' % (date, wind_speed))

My reducer:

#!/usr/bin/env python

import sys

current_date = None
date = None  # Initialize date
max_wind_speed = 0
min_wind_speed = float('inf')
for line in sys.stdin:
    line = line.strip()
    date, wind_speed = line.split('\t', 1)
    try:
        wind_speed = float(wind_speed)
    except ValueError:
        continue
    if current_date == date:
        max_wind_speed = max(max_wind_speed, wind_speed)
        min_wind_speed = min(min_wind_speed, wind_speed)
    else:
        if current_date:
            print('%s\t%s' % (current_date, max_wind_speed - min_wind_speed))
        max_wind_speed = wind_speed
        min_wind_speed = wind_speed
        current_date = date
if current_date == date:
    print('%s\t%s' % (current_date, max_wind_speed - min_wind_speed))

The command I'm using in the terminal:

hadoop jar /opt/hadoop/current/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
-file mapper.py -mapper mapper.py \
-file reducer.py -reducer reducer.py \
-input filename.txt \
-output output1

The full error text:

2023-06-15 18:38:18,871 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-unjar16415805188223852506/] []/tmp/streamjob4666494938393678435.jar tmpDir=null
2023-06-15 18:38:19,468 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at lena-master/128.86.245.64:8032
2023-06-15 18:38:19,565 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at lena-master/128.86.245.64:8032
2023-06-15 18:38:19,751 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/achap001/.staging/job_1684218082735_7133
2023-06-15 18:38:20,026 INFO mapred.FileInputFormat: Total input files to process: 1
2023-06-15 18:38:20,094 INFO mapreduce.JobSubmitter: number of splits:2
2023-06-15 18:38:20,236 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1684218082735_7133
2023-06-15 18:38:20,236 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-06-15 18:38:20,337 INFO conf.Configuration: resource-types.xml not found
2023-06-15 18:38:20,337 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2023-06-15 18:38:20,575 INFO impl.YarnClientImpl: Submitted application application_1684218082735_7133
2023-06-15 18:38:20,610 INFO mapreduce.Job: The url to track the job: http://lena-master:8088/proxy/application_1684218082735_7133/
2023-06-15 18:38:20,611 INFO mapreduce.Job: Running job: job_1684218082735_7133
2023-06-15 18:38:25,692 INFO mapreduce.Job: Job job_1684218082735_7133 running inuber mode : false
2023-06-15 18:38:25,695 INFO mapreduce.Job:  map 0% reduce 0%
2023-06-15 18:38:27,756 INFO mapreduce.Job: Task Id : attempt_1684218082735_7133_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2023-06-15 18:38:28,780 INFO mapreduce.Job:  map 50% reduce 0%
2023-06-15 18:38:30,797 INFO mapreduce.Job: Task Id : attempt_1684218082735_7133_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2023-06-15 18:38:33,821 INFO mapreduce.Job: Task Id : attempt_1684218082735_7133_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2023-06-15 18:38:37,846 INFO mapreduce.Job:  map 100% reduce 100%
2023-06-15 18:38:37,858 INFO mapreduce.Job: Job job_1684218082735_7133 failed with state FAILED due to: Task failed task_1684218082735_7133_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

2023-06-15 18:38:37,959 INFO mapreduce.Job: Counters: 42
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=352981
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=645602
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=3
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Failed map tasks=4
                Killed reduce tasks=1
                Launched map tasks=5
                Launched reduce tasks=1
                Other local map tasks=3
                Rack-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=43550
                Total time spent by all reduces in occupied slots (ms)=34160
                Total time spent by all map tasks (ms)=8710
                Total time spent by all reduce tasks (ms)=6832
                Total vcore-milliseconds taken by all map tasks=8710
                Total vcore-milliseconds taken by all reduce tasks=6832
                Total megabyte-milliseconds taken by all map tasks=44595200
                Total megabyte-milliseconds taken by all reduce tasks=34979840
        Map-Reduce Framework
                Map input records=5004
                Map output records=4960
                Map output bytes=74400
                Map output materialized bytes=84326
                Input split bytes=99
                Combine input records=0
                Spilled Records=4960
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=18
                CPU time spent (ms)=810
                Physical memory (bytes) snapshot=355909632
                Virtual memory (bytes) snapshot=6371753984
                Total committed heap usage (bytes)=1052770304
                Peak Map Physical memory (bytes)=355909632
                Peak Map Virtual memory (bytes)=6371753984
        File Input Format Counters
                Bytes Read=645503
2023-06-15 18:38:37,960 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!

Does anyone know what may be causing this error?

I have tested it many times, created samples of the data and run tests on those, but no luck. Locally everything works fine but as soon as I try to use Hadoop streaming it stops working.

0

There are 0 best solutions below