I have this datetime string in my dataset: '2023061218154258' and I want to convert it to datetime, using below code. However the format that I expect to work, doesn't work, namely: yyyyMMddHHmmssSS. This code will reproduce the issue:
from pyspark.sql.functions import *
spark.conf.set("spark.sql.legacy.timeParserPolicy","CORRECTED")
# If the config is set to CORRECTED then the conversion will return null instead of throwing an exception.
df=spark.createDataFrame(
data=[ ("1", "2023061218154258")
, ("2", "20230612181542.58")]
,schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
1. df.withColumn("timestamp",to_timestamp("input_timestamp", format = 'yyyyMMddHHmmssSS')).show(truncate=False)
df.withColumn("timestamp",to_timestamp("input_timestamp", format = 'yyyyMMddHHmmss.SS')).show(truncate=False)
output:
+---+-----------------+---------+
|id |input_timestamp |timestamp|
+---+-----------------+---------+
|1 |2023061218154258 |null |
|2 |20230612181542.58|null |
+---+-----------------+---------+
+---+-----------------+----------------------+
|id |input_timestamp |timestamp |
+---+-----------------+----------------------+
|1 |2023061218154258 |null |
|2 |20230612181542.58|2023-06-12 18:15:42.58|
+---+-----------------+----------------------+
I tried to_timestamp with the format yyyyMMddHHmmssSS and I expected that it would convert the string 2023061218154258 into the timestamp 2023-06-12 18:15:42.58
The issue you're encountering is due to the limitations of the
to_timestampfunction in PySpark. Theto_timestampfunction expects the timestamp format to conform to the Java SimpleDateFormat standard, which doesn't support sub-second precision beyond milliseconds (SSS).In your case, the format 'yyyyMMddHHmmssSS' won't work because it expects exactly two digits for the sub-second portion. To overcome this limitation, you can manually parse the string and convert it to a timestamp using other functions available in PySpark.
Here's an example of how you can achieve the desired conversion by extracting the different components from the string and creating a timestamp using the
concatfunction:This code extracts the individual components (year, month, day, hour, minute, second, subsecond) from the
input_timestampcolumn and concatenates them with the appropriate delimiters to form a timestamp string. Then, theto_timestampfunction is applied to convert the resulting string to a timestamp.The output should be as follows:
As you can see, the conversion is successful, and the timestamps are in the expected format.
UPDATE