How to get a timestamp data type column without the seconds in Pyspark?

81 Views Asked by At

I have a timestamp column

data = [(1,'2023-01-22 09:00'),(2,'2023-09-11 00:09')]

schema = StructType([StructField("id",IntegerType(),False),StructField("ts",StringType(),True)])

main_df = spark.createDataFrame(data,schema)

main_df.printSchema()

root
 |-- id: integer (nullable = false)
 |-- ts: string (nullable = true)

main_df2 = main_df.withColumn('ts', date_format(to_timestamp(col('ts'),("yyyy-MM-dd HH:mm")),"yyyy-MM-dd HH:mm").cast("timestamp")).show()

main_df2.printSchema()

root
 |-- id: integer (nullable = false)
 |-- ts: timestamp (nullable = true)

main_df2.show()

+---+-------------------+
| id|                 ts|
+---+-------------------+
|  1|2023-01-22 09:00:00|
|  2|2023-09-11 00:09:00|
+---+-------------------+

Is it possible to have a timestamp datatype column, in Pyspark, without the seconds, like yyyy-MM-dd HH:mm?

Desired Output

+---+----------------+
| id|              ts|
+---+----------------+
|  1|2023-01-22 09:00|
|  2|2023-09-11 00:09|
+---+----------------+~

root
 |-- id: integer (nullable = false)
 |-- ts: timestamp (nullable = true

Thanks in advande

1

There are 1 best solutions below

2
Alex Ott On

You don't need .cast("timestamp") after you did a date_format - just remove it and you'll get what you need:

main_df.withColumn('ts', date_format(to_timestamp(col('ts'),
    ("yyyy-MM-dd HH:mm")),"yyyy-MM-dd HH:mm")).show()

+---+----------------+
| id|              ts|
+---+----------------+
|  1|2023-01-22 09:00|
|  2|2023-09-11 00:09|
+---+----------------+