Transformation of "Serial date" to the specified date format

88 Views Asked by At

I wrote a function that converts all the dates of the specified date column to the specified format. Any missing or invalid date is replaced with a value specified by the user.

The code also takes "serial dates" like "5679" into consideration. But my code isn't working on the serial numbers. Let me know where the issue is.

My code:

import pandas as pd
import math

def date_fun(df, date_inputs):
    col_name = date_inputs["DateColumn"]
    date_format = date_inputs["DateFormat"]
    replace_value = date_inputs.get("ReplaceDate", None)
    
    # Convert column values to string
    df[col_name] = df[col_name].astype(str)
    
    # Check if the column contains serial dates
    if df[col_name].str.isnumeric().all():
        # Convert the column to integer
        df[col_name] = pd.to_numeric(df[col_name], errors='coerce')
        
        # Check if the values are within the valid range of serial dates in Excel
        if df[col_name].between(1, 2958465).all():
            df[col_name] = pd.to_datetime(df[col_name], unit='D', errors='coerce')
        else:
            if replace_value is not None:
                df[col_name] = replace_value
            else:
                df[col_name] = "Invalid Date"
    else:
        df[col_name] = pd.to_datetime(df[col_name], errors='coerce')
    
    # Convert the datetime values to the specified format
    df[col_name] = df[col_name].dt.strftime(date_format)
    
    # Replace invalid or null dates with the specified value (if any)
    if replace_value is not None:
        replace_value = str(replace_value) # convert to string
        df[col_name] = df[col_name].fillna(replace_value)
    
    new_data = df[col_name].to_dict()
    
    # Handle NaN and infinity values
    def handle_nan_inf(val):
        if isinstance(val, float) and (math.isnan(val) or math.isinf(val)):
            return str(val)
        else:
            return val
    
    new_data = {k: handle_nan_inf(v) for k, v in new_data.items()}
    
    return new_data

Example:

Input: 45678
Expected Output: 2024-06-27  
Current Output: NaN 

  

Input

25.09.2019
9/16/2015
10.12.2017
02.12.2014
08-Mar-18
08-12-2016
26.04.2016
05-03-2016
24.12.2016
10-Aug-19
abc
05-06-2015
12-2012-18
24-02-2010
2008,13,02
16-09-2015
23-01-1992, 7:45

2nd December 2018
45678

My output

            "2019/09/25",
            "2015/09/16",
            "2017/10/12",
            "2014/02/12",
            "2018/03/08",
            "2016/08/12",
            "2016/04/26",
            "2016/05/03",
            "2016/12/24",
            "2019/08/10",
            "nan",
            "2015/05/06",
            "nan",
            "2010/02/24",
            "2008/02/01",
            "2015/09/16",
            "1992/01/23",
            "nan",
            "2018/12/02",
            "nan"

Date Format specified: "%Y/%m/%d"
2

There are 2 best solutions below

2
Anne On

You use if df[col_name].str.isnumeric().all():

This checks if the entire column contains only numerical characters, which in this case will return False as there are dates separated by dashes too. The whole column is then coerced to data time, resulting in nan for the serial date.

I think what you wanted to do is to iterate by row. You could easily do this over the whole df using df.iterrows(). As you are relying so heavily on the default to_datetime() already anyway, you could also call this first and then iterate only over the rows where this throws an error.

Either way, you should go through the other inputs/outputs as well, as there seem to be some more errors (e.g. 2008,13,02 becomes "2008/02/01").

0
Golden Lion On

https://gist.github.com/OmarArain/9959241

try using pandas timedelta

date=5679

def convert_excel_time(excel_time):
    '''
    converts excel float format to pandas datetime object
    round to '1min' with 
    .dt.round('1min') to correct floating point conversion innaccuracy
    '''
    
    return pd.to_datetime('1899-12-30') + pd.to_timedelta(excel_time,'D')

print(convert_excel_time(date))

output:

 1915-07-19 00:00:00