I wrote a function that converts all the dates of the specified date column to the specified format. Any missing or invalid date is replaced with a value specified by the user.
The code also takes "serial dates" like "5679" into consideration. But my code isn't working on the serial numbers. Let me know where the issue is.
My code:
import pandas as pd
import math
def date_fun(df, date_inputs):
col_name = date_inputs["DateColumn"]
date_format = date_inputs["DateFormat"]
replace_value = date_inputs.get("ReplaceDate", None)
# Convert column values to string
df[col_name] = df[col_name].astype(str)
# Check if the column contains serial dates
if df[col_name].str.isnumeric().all():
# Convert the column to integer
df[col_name] = pd.to_numeric(df[col_name], errors='coerce')
# Check if the values are within the valid range of serial dates in Excel
if df[col_name].between(1, 2958465).all():
df[col_name] = pd.to_datetime(df[col_name], unit='D', errors='coerce')
else:
if replace_value is not None:
df[col_name] = replace_value
else:
df[col_name] = "Invalid Date"
else:
df[col_name] = pd.to_datetime(df[col_name], errors='coerce')
# Convert the datetime values to the specified format
df[col_name] = df[col_name].dt.strftime(date_format)
# Replace invalid or null dates with the specified value (if any)
if replace_value is not None:
replace_value = str(replace_value) # convert to string
df[col_name] = df[col_name].fillna(replace_value)
new_data = df[col_name].to_dict()
# Handle NaN and infinity values
def handle_nan_inf(val):
if isinstance(val, float) and (math.isnan(val) or math.isinf(val)):
return str(val)
else:
return val
new_data = {k: handle_nan_inf(v) for k, v in new_data.items()}
return new_data
Example:
Input: 45678
Expected Output: 2024-06-27
Current Output: NaN
Input
25.09.2019
9/16/2015
10.12.2017
02.12.2014
08-Mar-18
08-12-2016
26.04.2016
05-03-2016
24.12.2016
10-Aug-19
abc
05-06-2015
12-2012-18
24-02-2010
2008,13,02
16-09-2015
23-01-1992, 7:45
2nd December 2018
45678
My output
"2019/09/25",
"2015/09/16",
"2017/10/12",
"2014/02/12",
"2018/03/08",
"2016/08/12",
"2016/04/26",
"2016/05/03",
"2016/12/24",
"2019/08/10",
"nan",
"2015/05/06",
"nan",
"2010/02/24",
"2008/02/01",
"2015/09/16",
"1992/01/23",
"nan",
"2018/12/02",
"nan"
Date Format specified: "%Y/%m/%d"
You use
if df[col_name].str.isnumeric().all():This checks if the entire column contains only numerical characters, which in this case will return False as there are dates separated by dashes too. The whole column is then coerced to data time, resulting in nan for the serial date.
I think what you wanted to do is to iterate by row. You could easily do this over the whole df using
df.iterrows(). As you are relying so heavily on the defaultto_datetime()already anyway, you could also call this first and then iterate only over the rows where this throws an error.Either way, you should go through the other inputs/outputs as well, as there seem to be some more errors (e.g. 2008,13,02 becomes "2008/02/01").