converting datetime string with milliseconds using spark 3.3.2 requires a mandatory dot

362 Views Asked by At

I have this datetime string in my dataset: '2023061218154258' and I want to convert it to datetime, using below code. However the format that I expect to work, doesn't work, namely: yyyyMMddHHmmssSS. This code will reproduce the issue:

from pyspark.sql.functions import *
spark.conf.set("spark.sql.legacy.timeParserPolicy","CORRECTED")
# If the config is set to CORRECTED then the conversion will return null instead of throwing an exception.

df=spark.createDataFrame(
         data=[ ("1",  "2023061218154258")
                , ("2", "20230612181542.58")]
        ,schema=["id","input_timestamp"])
df.printSchema()

#Timestamp String to DateType
1. df.withColumn("timestamp",to_timestamp("input_timestamp", format = 'yyyyMMddHHmmssSS')).show(truncate=False)
df.withColumn("timestamp",to_timestamp("input_timestamp", format = 'yyyyMMddHHmmss.SS')).show(truncate=False)

output:

+---+-----------------+---------+
|id |input_timestamp  |timestamp|
+---+-----------------+---------+
|1  |2023061218154258 |null     |
|2  |20230612181542.58|null     |
+---+-----------------+---------+

+---+-----------------+----------------------+
|id |input_timestamp  |timestamp             |
+---+-----------------+----------------------+
|1  |2023061218154258 |null                  |
|2  |20230612181542.58|2023-06-12 18:15:42.58|
+---+-----------------+----------------------+

I tried to_timestamp with the format yyyyMMddHHmmssSS and I expected that it would convert the string 2023061218154258 into the timestamp 2023-06-12 18:15:42.58

1

There are 1 best solutions below

4
Omar LARAQUI On

The issue you're encountering is due to the limitations of the to_timestamp function in PySpark. The to_timestamp function expects the timestamp format to conform to the Java SimpleDateFormat standard, which doesn't support sub-second precision beyond milliseconds (SSS).

In your case, the format 'yyyyMMddHHmmssSS' won't work because it expects exactly two digits for the sub-second portion. To overcome this limitation, you can manually parse the string and convert it to a timestamp using other functions available in PySpark.

Here's an example of how you can achieve the desired conversion by extracting the different components from the string and creating a timestamp using the concat function:

from pyspark.sql.functions import *

df.withColumn("year", substring("input_timestamp", 1, 4)) \
  .withColumn("month", substring("input_timestamp", 5, 2)) \
  .withColumn("day", substring("input_timestamp", 7, 2)) \
  .withColumn("hour", substring("input_timestamp", 9, 2)) \
  .withColumn("minute", substring("input_timestamp", 11, 2)) \
  .withColumn("second", substring("input_timestamp", 13, 2)) \
  .withColumn("subsecond", substring("input_timestamp", 15, 2)) \
  .withColumn("timestamp", concat(col("year"), lit("-"), col("month"), lit("-"), col("day"),
                                  lit(" "), col("hour"), lit(":"), col("minute"), lit(":"),
                                  col("second"), lit("."), col("subsecond"))) \
  .withColumn("timestamp", to_timestamp("timestamp")) \
  .show(truncate=False)

This code extracts the individual components (year, month, day, hour, minute, second, subsecond) from the input_timestamp column and concatenates them with the appropriate delimiters to form a timestamp string. Then, the to_timestamp function is applied to convert the resulting string to a timestamp.

The output should be as follows:

+---+-----------------+----+-----+---+----+------+---------+----------------------+
|id |input_timestamp  |year|month|day|hour|minute|second   |subsecond|timestamp             |
+---+-----------------+----+-----+---+----+------+---------+----------------------+
|1  |2023061218154258|2023|06   |12 |18  |15    |42       |58       |2023-06-12 18:15:42.58|
|2  |20230612181542.58|2023|06   |12 |18  |15    |42       |58       |2023-06-12 18:15:42.58|
+---+-----------------+----+-----+---+----+------+---------+----------------------+

As you can see, the conversion is successful, and the timestamps are in the expected format.

UPDATE

from pyspark.sql.functions import *

df.withColumn("timestamp",
    when(col("input_timestamp").contains("."), to_timestamp("input_timestamp", "yyyyMMddHHmmss.SS"))
    .otherwise(to_timestamp("input_timestamp", "yyyyMMddHHmmssSS"))
).show(truncate=False)