hadoop : is somehting wrong when I run a MapReduce program to count words?

73 Views Asked by At

I'm trying to learn to use Hadoop. I have an old laptop and I installed Linux Mint 21. I was able to install hadoop.

The commands belows are ok: when I do hdfs dfs -ls / I can see this:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/dell/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Found 2 items
drwxr-xr-x   - nenn supergroup          0 2023-02-11 15:30 /my_data
drwx------   - nenn supergroup          0 2023-02-11 15:21 /tmp

In in my_data, i have a txt file when I do hdfs dfs -ls -R / here is an extract of what i see:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/dell/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
drwxr-xr-x   - nenn supergroup          0 2023-02-11 15:30 /my_data
> -rw-r--r--   1 nenn supergroup    1174876 2023-02-11 15:01 /my_data/book1.txt
drwx------   - nenn supergroup          0 2023-02-11 15:21 /tmp
drwx------   - nenn supergroup          0 2023-02-11 15:21 /tmp/hadoop-yarn
drwx------   - nenn supergroup          0 2023-02-11 15:29 /tmp/hadoop-yarn/staging
drwx------   - nenn supergroup          0 2023-02-11 15:21 /tmp/hadoop-yarn/staging/d

I want to run a scrip which count the words which are in book1.txt. I started YARN with ~/hadoop/sbin/start-yarn.sh

Then I run the script hadoop jar /home/nenn/wordcount.jar WordCount /my_data/book1.txt /my_data/output_wordcount

And i see this:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/dell/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
23/02/11 15:33:55 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
23/02/11 15:33:55 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
23/02/11 15:33:56 INFO input.FileInputFormat: Total input paths to process : 1
23/02/11 15:33:56 INFO mapreduce.JobSubmitter: number of splits:1
23/02/11 15:33:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1676124615395_0004
23/02/11 15:33:56 INFO impl.YarnClientImpl: Submitted application application_1676124615395_0004
23/02/11 15:33:56 INFO mapreduce.Job: The url to track the job: http://my-computer-05:8088/proxy/application_1676124615395_0004/
23/02/11 15:33:56 INFO mapreduce.Job: Running job: job_1676124615395_0004

I see this for at least 5 min or more. is the script still counting the words or is something wrong?

the script wordcount.jar was given by my school. When I tried in my school, it worked. But Now I want to try on my own computer and I know whether it is working or not.

can you help me?

wordcout.jar has this code:

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

// defining the class WordCount
public class WordCount {

  // defining the class TokenizerMapper
  // this class is in charge of the mapping process
  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    // it extends the class Mapper from mapreduce api
    // this mapper takes as input an Object (identifier of the partition) and a Text (the partition of the text)
    // it outputs a Text (a word) and an Integer (1)

    // defining the value to emit
    private final static IntWritable one = new IntWritable(1);
    // initializing the word to emit
    private Text word = new Text();

    // defining the function performed during map
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      // tokenizing the text partition
      StringTokenizer itr = new StringTokenizer(value.toString());

      // running through the tokens
      while (itr.hasMoreTokens()) {
        // setting the value of word
        word.set(itr.nextToken().toLowerCase().replaceAll("[^a-z 0-9A-Z]",""));
        // emitting the key-value pair
        context.write(word, one);
      }
    }
  }

  // defining the class IntSumReducer
  // this class is in charge of the reducing process
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    // it extends the class Reducer from mapreduce api
    // it takes as input a Text (a word) and a list of integers (1s)
    // it outputs a Text (a word) and an integer (the frequency of the word)

    // initializing the frequency
    private IntWritable result = new IntWritable();

    // defining the function performed during reduce
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      // initializing the the sum
      int sum = 0;
      // running through the values associated to this key
      for (IntWritable val : values) {
        // incrementing the sum
        sum += val.get();
      }
      // attributing the sum to the value to emit
      result.set(sum);
      // emitting the key-value pair
      context.write(key, result);
    }
  }

  // defining the main class containing the parameters of the job
  public static void main(String[] args) throws Exception {
    // initializing configuration
    Configuration conf = new Configuration();
    // initializing job
    Job job = Job.getInstance(conf, "word count");
    // providing job with the classes for mapper and reducer
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class); // mapper
    job.setCombinerClass(IntSumReducer.class); // combiner
    job.setReducerClass(IntSumReducer.class); // reducer
    // providing job with the output classes
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    // arguments to interpret
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    // completion of the job
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}
1

There are 1 best solutions below

4
OneCricketeer On

Know whether it is working or not

Open The url to track the job from the logs in your browser and look at the actual logs of the application in the YARN UI

Or use yarn logs command, given the application ID from the line above in a separate terminal