I have the following dataset
s1, s2, count
1, 2, x1
1, 3, x2
1, 4, x3
2, 1, y1
2, 3, y2
2, 4, y3
3, 1, z1
3, 2, z2
I want to get the following output
s1, s2, count
1, 2, x1-y1
1, 3, x2-z1
1, 4, x3
2, 3, y2-z2
2, 4, y3
The idea is that s1 is an entity which is favored over s2. And I have tuples such that s1 (say = 1) was favored over s2 (say = 2) by x1 times AND s1 (say = 2) was favored over s2 (say = 1) by y1 times. What I need is a sub O(n^2) algorithm to compute absolute number of times s1 was favored over s2 (or other way round). (x1-y1)
The problem is that there are 230 million such tuples and I can't have O(n^2) algorithm to compute this.
One observation is that the tuples are sorted on s1 since they are results of another MR output.
Please help me find a better solution.
I'm not sure I understand the "favoring." It looks as though you want to subtract values where the
s1,s2values are the same.You can define a custom
Comparable/Writable, let's call itS1S2Writable, that encapsulates(s1, s2)as a tuple and states the two tuples are equal whenWith that, you can define a process using
Mapper<S1S2Writable, IntWritable, S1S2Writable, IntWritable>to read your input file, and pass it to aReducer<S1S2Writable, IntWritable, KEYOUT, IntWritable>.This will group the
S1S2WritablewithIterable<IntWritable>, which you can perform a subtraction over.