I am generating data using TPC-DS.
I load the customers table to a dataframe. The c_first_sales_date_sk column has values such as 2449001, which makes me think they are Julian calendar dates of type yyyyDD.
So far I have tried:
from pyspark.sql.functions import to_date, from_unixtime
df_with_date = df.withColumn("c_first_sales_date", to_date(col("c_first_sales_date_sk"), format="yyyyDDD"))
display(df_with_date)
Applying this, it will convert 2449001 to 2449-01-01, which is wrong. The online convert at http://www.longpelaexpertise.com/toolsJulian.php converts the same date to 01-Jan-2024.
What am I doing wrong? How do I convert this column properly?
