DUMP command in PIG not working

1.2k Views Asked by At

I wrote a simple PIG program as follows to analyze a small and a modified version of the google n-grams dataset on AWS. The data looks something like this:

I am 1936 942 90
I am 1945 811 5
I am 1951 47 12
very cool 1923 118 10
very cool 1980 320 100
very cool 2012 994 302
very cool 2017 1820 612

and has the form:

n-gram TAB year TAB occurrences TAB books NEWLINE

I wrote the following program to calculate the occurences of an ngram per book:

inp = LOAD <insert input path here> AS (ngram:chararray, year:int, occurences:int, books:int);
filter_input = FILTER inp BY (occurences >= 400) AND (books >= 8);
groupinp = GROUP filter_input BY ngram;
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(occurences) AS socc , SUM(books) AS nbooks;

DUMP sum_occ;

However, the DUMP command does not work and gives the following error:

892520 [main] INFO  org.apache.pig.tools.pigstats.ScriptState  - Pig features used in the script: GROUP_BY,FILTER
18/03/28 00:56:09 INFO pigstats.ScriptState: Pig features used in the script: GROUP_BY,FILTER
1892554 [main] INFO  org.apache.pig.data.SchemaTupleBackend  - Key [pig.schematuple] was not set... will not generate code.
18/03/28 00:56:09 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
1892555 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer  - {RULES_ENABLED=[ConstantCalculator, LoadTypeCastInserter, PredicatePushdownOptimizer, StreamTypeCastInserter], RULES_DISABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter]}
18/03/28 00:56:09 INFO optimizer.LogicalPlanOptimizer: {RULES_ENABLED=[ConstantCalculator, LoadTypeCastInserter, PredicatePushdownOptimizer, StreamTypeCastInserter], RULES_DISABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter]}
1892591 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher  - Tez staging directory is /tmp/temp383666093 and resources directory is /tmp/temp383666093
18/03/28 00:56:09 INFO tez.TezLauncher: Tez staging directory is /tmp/temp383666093 and resources directory is /tmp/temp383666093
1892592 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.plan.TezCompiler  - File concatenation threshold: 100 optimistic? false
18/03/28 00:56:09 INFO plan.TezCompiler: File concatenation threshold: 100 optimistic? false
1892593 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.AccumulatorOptimizerUtil  - Reducer is to run in accumulative mode.
18/03/28 00:56:09 INFO util.AccumulatorOptimizerUtil: Reducer is to run in accumulative mode.
1892606 [main] INFO  org.apache.pig.builtin.PigStorage  - Using PigTextInputFormat
18/03/28 00:56:09 INFO builtin.PigStorage: Using PigTextInputFormat
18/03/28 00:56:09 INFO input.FileInputFormat: Total input files to process : 1
1892626 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths to process : 1
18/03/28 00:56:09 INFO util.MapRedUtil: Total input paths to process : 1
1892627 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths (combined) to process : 1
18/03/28 00:56:09 INFO util.MapRedUtil: Total input paths (combined) to process : 1
18/03/28 00:56:09 INFO hadoop.MRInputHelpers: NumSplits: 1, SerializedSize: 408
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: joda-time-2.9.4.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: joda-time-2.9.4.jar
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: pig-0.17.0-core-h2.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: pig-0.17.0-core-h2.jar
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: antlr-runtime-3.4.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: antlr-runtime-3.4.jar
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: automaton-1.11-8.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: automaton-1.11-8.jar
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - For vertex - scope-239: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
18/03/28 00:56:09 INFO tez.TezDagBuilder: For vertex - scope-239: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Processing aliases: filter_input,groupinp,inp
18/03/28 00:56:09 INFO tez.TezDagBuilder: Processing aliases: filter_input,groupinp,inp
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],groupinp[3,11]
18/03/28 00:56:09 INFO tez.TezDagBuilder: Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],groupinp[3,11]
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Pig features in the vertex: 
18/03/28 00:56:09 INFO tez.TezDagBuilder: Pig features in the vertex: 
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Set auto parallelism for vertex scope-240
18/03/28 00:56:09 INFO tez.TezDagBuilder: Set auto parallelism for vertex scope-240
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - For vertex - scope-240: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
18/03/28 00:56:09 INFO tez.TezDagBuilder: For vertex - scope-240: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Processing aliases: sum_occ
18/03/28 00:56:09 INFO tez.TezDagBuilder: Processing aliases: sum_occ
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Detailed locations: sum_occ[5,10]
18/03/28 00:56:09 INFO tez.TezDagBuilder: Detailed locations: sum_occ[5,10]
1892745 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Pig features in the vertex: GROUP_BY
18/03/28 00:56:09 INFO tez.TezDagBuilder: Pig features in the vertex: GROUP_BY
1892762 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 2017: Internal error creating job configuration.
18/03/28 00:56:09 ERROR grunt.Grunt: ERROR 2017: Internal error creating job configuration.
Details at logfile: /mnt/var/log/pig/pig_1522196676602.log

How do I fix this?

3

There are 3 best solutions below

0
hprakash On BEST ANSWER

If you are using an old version, kindly update it (should solve your problem)

PIG scripts are lazily evaluated, so unless you use a DUMP or STORE command you will not know what is wrong with your code.

When you run your code it will again throw the following error:

ERROR 1025: Invalid field projection. Projected field [occurences] does not exist in schema: group:chararray,filter_input:bag{:tuple(ngram:chararray,year:int,occurences:int,books:int)}.

Change the below line from

sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(occurences) AS socc , SUM(books) AS nbooks;

to

sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(filter_input.occurences) AS socc, SUM(filter_input.books) AS nbooks;

will solve this error.

4
Koji On

I don't have enough reputation for making the comment, so writing it here. My guess is you have unclosed quote. What do you have at "insert input path here" part? Is the path enclosed with single quotes?

0
Rajnil Guha On

Not having enough reputations to comment so posting here, are writing the above pig statements in a script or running individually from grunt shell. Also can you give a brief about the logic behind sum_occ relation.