Apache Crunch Job On AWS EMR using Oozie

18 Views Asked by Stefan Ss At 06 March 2024 at 12:44

Context:

I want to run an apache crunch job on AWS EMR
this job is part of a pipeline of oozie java actions and oozie subworkflows (this particular job is part of a subworkflow). In oozie we have a main workflow and inside this workflow we have multiple java actions (which specify the entry, the java class that it should run) and one subworkflow ( which contains multiple java actions. One of the java actions has our job driver class).
this job is reading some data from HDFS and writes to HDFS (reads the data in a PCollection -> does some processing on it -> writes the PCollection to HDFS.

Issue:

the PCollection is not written to hdfs (although we do a PCollection.write we can't see anything on HDFS, but the _SUCCESS file is generated in that directory).

Things I've seen:

I've created a dummy method which takes a PCollection (generated of 3 strings), did a map on it (change all the letters to uppercase) and then wrote the resulting PCollection on HDFS. I added this piece of code in the subworkflow java action (the one that contains the job that does not write to HDFS) and in a top level java action (by top level java action I mean actions that are under the main workflow). In the subworkflow I have the same issue, the result is not written in HDFS, but on the top level Java Action it writes the dummy output and I can see the file written in HDFS).

Exemple of oozie pipeline:

  Java_Action_1 (which points to a java class that is being run)

  Java_Action_2 (which points to a java class that is being run)

  Java_Action_3 (which points to a java class that is being run)

  Subworkflow_1 (has a fork and join step, seen it in the Oozie UI)

        Java_Action_1_in_subworkflow (which points to a java class that is being run) -> job that is not writing to HDFS

         Java_Action_1_in_subworkflow (which points to a java class that is being run)

  Java_Action_4 (which points to a java class that is being run)

  Java_Action_5 (which points to a java class that is being run)

There are 0 best solutions below