Adding count information to mapreduce output

47 Views Asked by At

I have data that came from sys.stdout mapper.py program as follows:

input from stdout of previous mapper.py

chevy, {mod: spark | col: brown}
chevy, {mod: equinox | col: red}
honda, {mod:civic | col:black}
honda, {mod:accord | col:white} 
honda, {mod:crv | col:pink} 
honda, {mod:hrv | col:gray} 
toyota, {mod:corola | col:white}

I would like to write a reducer.py or maybe even another mapper that takes this information and produces an output such as:

Expected output

chevy, {mod: spark | col: brown | total:2}
chevy, {mod: equinox | col: red | total:2}
honda, {mod:civic | col:black | total:4}
honda, {mod:accord | col:white | total:4} 
honda, {mod:crv | col:pink | total:4} 
honda, {mod:hrv | col:gray | total:4} 
toyota, {mod:corola | col:white | total:1}

the total is only for the keys (car brand), so chevy appears twice, honda appears 4 times, and toyota 1.

I have tried a reducer.py program and it did not work. The program I wrote looks like this:

curr_k = None
curr_v = None
k = None
curr_count = 0

for car in sys.stdin:
    car_split = car.split('|')
    k = car_split[0]
    v = car_split[1]

    if curr_k == k:
        print(curr_k, curr_v, 'total:',curr_count)
        curr_count += 1

    else:
        if curr_k:
            print(curr_k, curr_v, 'total:',curr_count)
        curr_k = k
        curr_count = 1
    
if curr_k == k:
    print(curr_k, curr_v, 'total:',curr_count)

The above code gave me the following answer:

chevy, {mod: spark | col: brown | total:1}
chevy, {mod: equinox | col: red | total:2}
honda, {mod:civic | col:black | total:1}
honda, {mod:accord | col:white | total:2} 
honda, {mod:crv | col:pink | total:3} 
honda, {mod:hrv | col:gray | total:4} 
toyota, {mod:corola | col:white | total:1}

But that is not what I am looking for.

1

There are 1 best solutions below

0
OneCricketeer On

At the time you print each total, you're only accounting for the lines you've seen up-to that point.

You need to read all lines, capturing the correct "total", then print all output. You will need to use dictionaries and lists to do this, as shown in your new question.