I am currently writing a Hadoop program that outputs the top 100 most tweeted hastags given a data set of tweets. I was able to output all the hashtags with the WordCount program. So the output looks like this, ignore the quotation marks:
"#USA 2"
"#Holy 5"
"#SOS 3"
"#Love 66"
However, I ran into trouble when I attempt to sort them by their word frequencies (the value) with the code from here.
I noticed that the key are integers instead of strings for the program input provided in the link above. I try changing a few parameters in the code to fit my usage but it didn't work out so well, as I don't understand them so well. Please help me!
You need a second
mapReducejob, Where the input is the output of your first job.I have tweaked the code to make it work as per your wish.
For Input
The output should be
I have assumed that tab is delimited between hashtag and count. If it is something else, please change that. The code is not tested, please let me know if it works.