How to remove contents in multiple columns and create a new column in a DataFrame In Python

72 Views Asked by At

I have a column (name of column: KI_2783023 (Please enter supplier B-BBEE certificate expiry date:) in my SAP_Supplier_Yes data frame that has the data as below format.

KI_2783023 (Please enter supplier B-BBEE certificate expiry date:)
0   Thu, 7 Jul, 2022
4   Thu, 22 Jul, 2021
5   NaN
6   Tue, 31 May, 2022
7   Thu, 5 Aug, 2021
... ...
10431   NaN
10434   NaN
10437   NaN
10438   NaN
10440   NaN

I have also created a data dictionary:

Dt_Mnth_Dict = {'Jan': '01', 'Feb': '02', 'Mar': '03','Apr': '04', 'May': '05', 'Jun': '06','Jul': '07', 'Aug': '08', 'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'}

I have also created new columns in the df to pick out some contents of the data from the above-mentioned column as follows:

#creating columns [Date] from sliced data string from the SAP_Supplier_Yes DataFrame: 
SAP_Supplier_Yes['Date_of_Exp_Date'] = SAP_Supplier_Yes['KI_2783023 (Please enter supplier B-BBEE certificate expiry date:)'].str[5:7]

#creating columns [Year] from sliced data string from the SAP_Supplier_Yes DataFrame
SAP_Supplier_Yes['Year_of_Exp_Date'] = SAP_Supplier_Yes['KI_2783023 (Please enter supplier B-BBEE certificate expiry date:)'].str[-4:]

#slicing the data to pick out the month
SAP_Supplier_Yes['Month_of_Exp_Date'] = SAP_Supplier_Yes['KI_2783023 (Please enter supplier B-BBEE certificate expiry date:)'].str[7:11]

#removing a comma character from the Month column
SAP_Supplier_Yes['Month_of_Exp_Date_'] = SAP_Supplier_Yes['Month_of_Exp_Date'].str.replace(',', '')
#using the Dt_Mnth_Dict dictionary to replace the Month 'word' with the number as per the dictionary
SAP_Supplier_Yes['Month_of_Exp_Date_'].map(Dt_Mnth_Dict) # note: if the dictionary does not exhaustively map all and entries then non-matched entries are changed to NaNs

#creating the FINAL date field (concatenating all three created above)
SAP_Supplier_Yes['KI_2783023 (Please enter supplier B-BBEE certificate expiry date:)_'] = SAP_Supplier_Yes['Date_of_Exp_Date'] + SAP_Supplier_Yes['Month_of_Exp_Date_'] + SAP_Supplier_Yes['Year_of_Exp_Date']

When I view the 'new' column, SAP_Supplier_Yes['KI_2783023 (Please enter supplier B-BBEE certificate expiry date:)_']:, I get the dates in this format:

0         7 Jul2022
4        22 Jul2021
5               NaN
6        31 May2022
7         5 Aug2021
            ...    
10431           NaN
10434           NaN
10437           NaN
10438           NaN
10440           NaN

Instead of the following format

0         7 072022
4        22 072021
5               NaN
6        31 042022
7         5 082021
            ...    
10431           NaN
10434           NaN
10437           NaN
10438           NaN
10440           NaN

Also, I still want to change this field to a Date field with the format 'dd-mm-yyyy', as my Data Quality business rules are based on this column and related calculations?

Thanking you in advance

1

There are 1 best solutions below

1
Siyabonga_B On

I have tried the following and it worked fine (although it gave me a red Python message). Please see my solution below:

#created a date column #SAP_Supplier_Yes['Date_of_Exp_Date'] = SAP_Supplier_Yes['date_field'] = SAP_Supplier_Yes['KI_2783023 (Please enter supplier B-BBEE certificate expiry date:)'].str[5:7] #removing the space in the field SAP_Supplier_Yes['date_field_'] = SAP_Supplier_Yes['date_field'].str.replace(' ', '')

#created a 'Month' column SAP_Supplier_Yes['Month_of_Exp_Date'] = SAP_Supplier_Yes['KI_2783023 (Please enter supplier B-BBEE certificate expiry date:)'].str[7:11] #created another column to remove the [special character] comma and replacing it with nothing SAP_Supplier_Yes['Month_of_Exp_Date_'] = SAP_Supplier_Yes['Month_of_Exp_Date'].str.replace(',', '') #created another column to remove any spaces SAP_Supplier_Yes['Month_ONLY'] = SAP_Supplier_Yes['Month_of_Exp_Date_'].str.replace(' ', '')

#created a Year column SAP_Supplier_Yes['Year_of_Exp_Date'] = SAP_Supplier_Yes['KI_2783023 (Please enter supplier B-BBEE certificate expiry date:)'].str[-4:] #created another column to remove spaces SAP_Supplier_Yes['Year_of_Exp_Date'] = SAP_Supplier_Yes['Year_of_Exp_Date'].str.replace(' ', '')

#concatenated all three to create the FINAL field SAP_Supplier_Yes['FINAL_B-BBEE certificate expiry date'] = SAP_Supplier_Yes['date_field_'] + SAP_Supplier_Yes['Month_ONLY'] + SAP_Supplier_Yes['Year_of_Exp_Date']

#Creating another column to convert this field into a Date field/column SAP_Supplier_Yes['FINAL_B-BBEE_Certificate_Expiry_Date_'] = pd.to_datetime(SAP_Supplier_Yes['FINAL_B-BBEE certificate expiry date'])

Is there an shorter code I can use?