Context:
- I want to run an apache crunch job on AWS EMR
- this job is part of a pipeline of oozie java actions and oozie subworkflows (this particular job is part of a subworkflow). In oozie we have a main workflow and inside this workflow we have multiple java actions (which specify the entry, the java class that it should run) and one subworkflow ( which contains multiple java actions. One of the java actions has our job driver class).
- this job is reading some data from HDFS and writes to HDFS (reads the data in a PCollection -> does some processing on it -> writes the PCollection to HDFS.
Issue:
- the PCollection is not written to hdfs (although we do a PCollection.write we can't see anything on HDFS, but the _SUCCESS file is generated in that directory).
Things I've seen:
- I've created a dummy method which takes a PCollection (generated of 3 strings), did a map on it (change all the letters to uppercase) and then wrote the resulting PCollection on HDFS. I added this piece of code in the subworkflow java action (the one that contains the job that does not write to HDFS) and in a top level java action (by top level java action I mean actions that are under the main workflow). In the subworkflow I have the same issue, the result is not written in HDFS, but on the top level Java Action it writes the dummy output and I can see the file written in HDFS).
Exemple of oozie pipeline:
- Main Workflow
-
Java_Action_1 (which points to a java class that is being run) -
Java_Action_2 (which points to a java class that is being run) -
Java_Action_3 (which points to a java class that is being run) -
Subworkflow_1 (has a fork and join step, seen it in the Oozie UI) -
Java_Action_1_in_subworkflow (which points to a java class that is being run) -> job that is not writing to HDFS -
Java_Action_1_in_subworkflow (which points to a java class that is being run) -
Java_Action_4 (which points to a java class that is being run) -
Java_Action_5 (which points to a java class that is being run) -
etc.