Context:
I am able to submit a MapReduce job from druid overlord to an EMR. My Data source is in S3 in Parquet format. The timestamp field value is in format "2017-09-01 21:14:11:552 IST".
Error is while parsing the timestamp
Issue Stack trace is:
2018-01-18T19:31:52,509 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Task Id : attempt_1516108443547_0022_m_000068_0, Status : FAILED
Error: io.druid.java.util.common.RE: Failure on row[{"t": "2017-09-01 21:14:11:552 IST"}]
at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:91)
at io.druid.indexer.DetermineHashedPartitionsJob$DetermineCardinalityMapper.run(DetermineHashedPartitionsJob.java:288)
..
Caused by: java.lang.IllegalArgumentException: Invalid format: "2017-09-01 21:14:11:552 IST" is malformed at "IST"
at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945)
at io.druid.java.util.common.parsers.TimestampParser.lambda$createTimestampParser$4(TimestampParser.java:93)
at io.druid.java.util.common.parsers.TimestampParser.lambda$createObjectTimestampParser$8(TimestampParser.java:129)
. .
I have used different set of format that can parse but unable to get a format in joda lib. But, the timestamp format is readable in java.text.SimpleDateFormat see following code:
Sample Java program to parse Date
String text = "2017-09-01 21:14:11:552 IST";
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss:SSS zzz");
TimeZone gmt = TimeZone.getTimeZone("GMT");
sdf.setTimeZone(gmt);
sdf.setLenient(false);
try {
Date date = sdf.parse(text);
System.out.println(date);
System.out.println(sdf.format(date));
} catch (Exception e) {
e.printStackTrace();
}
Output
Fri Sep 01 21:14:11 IST 2017
2017-09-01 21:14:11:552 IST
Environment:
Druid version: 0.11
EMR version : emr-5.11.0
Hadoop version: Amazon 2.7.3
Druid input json
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
"paths": "s3://s3_path"
}
},
"dataSchema": {
"dataSource": "parquet_test1",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "ALL",
"intervals": ["2017-08-01T00:00:00:000Z/2017-08-02T00:00:00:000Z"]
},
"parser": {
"type": "parquet",
"parseSpec": {
"format": "timeAndDims",
"timestampSpec": {
"column": "t",
"format": "yyyy-MM-dd HH:mm:ss:SSS zzz"
},
"dimensionsSpec": {
"dimensions": [
"dim1","dim2","dim3"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [{
"type": "count",
"name": "count"
},{
"type" : "count",
"name" : "pid",
"fieldName" : "pid"
}]
},
"tuningConfig": {
"type": "hadoop",
"partitionsSpec": {
"targetPartitionSize": 5000000
},
"jobProperties" : {
"mapreduce.job.user.classpath.first": "true",
"fs.s3.awsAccessKeyId" : "KEYID",
"fs.s3.awsSecretAccessKey" : "AccessKey",
"fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"fs.s3n.awsAccessKeyId" : "KEYID",
"fs.s3n.awsSecretAccessKey" : "AccessKey",
"fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
},
"leaveIntermediate": true
}
}, "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.7.3", "org.apache.hadoop:hadoop-aws:2.7.3", "com.hadoop.gplcompression:hadoop-lzo:0.4.20"]
}
Possible solution
1. How to parse "2017-09-01 21:14:11:552 IST" in joda format
2. Any config to use SimpleDateFormat for parsing date in timestampSpec, as joda library is used default.
You have failed to parse the timezone abbreviation "IST". Such abbreviations are often ambivalent.
In this case, "IST" can stand for: "Europe/Dublin" (Irish Summer Time), "Asia/Jerusalem" (Israel Standard Time), "Asia/Kolkata" (India Standard Time). Looking at your name, I strongly assume that you want India Time.
Now I discuss several possible solutions and their advantages and drawbacks. A time library can use different strategies to resolve zone name ambiguities. Either it allows users to specify explicitly what zone they want (user-preference), or the region/country-information inside the current/associated locale might be used for resolving.
Joda-Time
The ONLY! solution is realized by following code:
While this approach based on explicit user-preference will probably satisfy your requirements because you don't need to change your dependency and preferred library, I consider this way as not so great for two reasons:
I recommend to set the user preference only once during program initialization. And then you can probably work with Joda.
Old
SimpleDateFormat-classYes, that works for you but not for me because the locale on my machine is not India. And I get the timestamp/instant of Israel (3.5 hours difference to India). We see that this old class uses the region info of associated locale in the background in order to resolve the name ambiguity, not the explicitly set tz-offset GMT (via
sdf.setTimeZone(gmt);).So please be very cautious where your code is running.
java.time (Java-8 or later)
This experiment reveals that the locale information for resolving the tz-ambiguity is unfortunately not used. But it is possible to specify the user-preference via a builder-based approach:
Here, the user-preference can be given as local parameter to the parser and does not suffer from any multi-thread-problem (better than Joda).
Time4J (my lib)
It can use a builder approach similar to Java-8 to set the user-preference (not shown here), or it can deploy a non-fixed-offset parameter in constructing the formatter or use the locale information parameter (for greatest flexibility).