Cast Date and Timestamp in Pyspark

103 Views Asked by At

I have an i/p file which is coming in a csv format where date and timestamp are coming as String in the fromat mm/dd/yy and yyyy-mm-dd-hh.mm.ss.SSSSSS.

I am writing a parquet file and converting to date and timestamp format with the below code.

if k == 'Date':
    data = data.withColumn(c, to_date(data[c],'MM/dd/yyyy'))
else:
    if k == 'Timestamp':
        data = data.withColumn(c, to_timestamp(data[c],'yyyy-MM-dd-HH.mm.ss.SSSSSS'))    

But in the o/p date is getting written in YYYY-MM-DD and timestamp is being written in the format YYYY-MM-DDTHH:MM:SS format.

Is there a way I can write the o/p as date and timestamp datatype with the format MM/dd/yyyy and yyyy-MM-dd-HH.mm.ss.SSSSSS

Need the o/p in format 07/24/2007 and 2007-07-24-14.45.09.811306 with storing them as date and timestampo datatype. But they are coming as 2007-07-24 and 2007-07-24 14:45:09 when I cast them.

1

There are 1 best solutions below

0
DileeprajnarayanThumula On

I have tried the below approach in pyspark to convert the string format to desired date format and timestamp format:

Input Data:

enter image description here

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, to_timestamp, date_format
csv_file_path = "abfss://[email protected]/smpl.csv"
df = spark.read.option("header", "true").csv(csv_file_path)
df = df.withColumn("date1", to_date("date1", "MM/dd/yyyy"))
df = df.withColumn("date2", to_timestamp("date2", "yyyy-MM-dd-HH.mm.ss.SSSSSS"))
df = df.withColumn("date1", date_format("date1", "MM/dd/yyyy"))
df.printSchema()
df.show(truncate=False)

Results:

root |-- date1: string (nullable = true) |-- date2: timestamp (nullable = true) 

enter image description here In the above code converting the "date1" column from string format to a DateType format using the to_date function.

Converting the "date2" column from string format to a TimestampType format using the to_timestamp function. The second argument "yyyy-MM-dd-HH.mm.ss.SSSSSS" specifies the format of the input string that represents the timestamp.

Reformating the "date1" column to the output format 'MM/dd/yyyy' using the date_format function.

I have tried Casting the String datatype to date datatype

df = df.withColumn("date1", to_date(date_format(col("date1"), "MM/dd/yyyy"), "MM/dd/yyyy").cast("date"))

enter image description here