How to dump Label Encoder values for multiple columns in a dataframe

121 Views Asked by At

As you can see, I have a preprocessing function here and doing some converting operations. I have some categorical variables and I defined them as categorical_cols, and using LabelEncoder for them. My mission is, saving the LabelEncoder for later uses. The LabelEncoder works fine, there is no problem, enter image description here ,

but when I save the LabelEncoder like this and try to use it in different preprocessing function by loading it;

---- LabelEncoder Save Side ----

for column in categorical_cols:
        label_encoder = LabelEncoder()
        taken_df[column] = label_encoder.fit_transform(taken_df[column])
        label_encoders[column] = label_encoder
        
    with open('label_encoders.pkl', 'wb') as file:
        pickle.dump(label_encoders, file)

---- End ----

---- LabelEncoder Load Side ----

categorical_cols = ['from_city', 'to_city',"vehicle_type","trailer_type"]
    
with open('label_encoders.pkl', 'rb') as file:
     label_encoders = pickle.load(file)

for column in categorical_cols:
     test_df[column] = label_encoders[column].fit_transform(test_df[column])

---- End ----

It works, but the output is different like this, enter image description here

everything is same, the used columns and even data is selected from original dataset for testing this issue. Therefore, my questions are;

  • Is it possible to save multiple columns and use it like this way or should I save every columns pickle file and use them as separetaly ?

  • Secondly, how can I solve this issue...

Here you can find my whole preprocessing function;

def preprocessed_data(taken_df):
    
    
    used_cols = [....]
    taken_df = taken_df[used_cols]
    taken_df["weight"] = taken_df["weight"].str.replace(",",".")
    taken_df["weight"] = taken_df["weight"].astype(float)
    taken_df.dropna(inplace=True)
    
    # Dealing with datetime columns
    taken_df["offer_date"] = pd.to_datetime(taken_df["offer_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    taken_df["cargo_load_date"] = pd.to_datetime(taken_df["cargo_load_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    taken_df["cargo_delivery_date"] = pd.to_datetime(taken_df["cargo_delivery_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    taken_df["vehicle_assignment_date"] = pd.to_datetime(taken_df["vehicle_assignment_date"]).dt.tz_localize(None).dt.tz_localize("UTC")
    
     
    vehicle_types = {
        "(?i).*(Tir|Tır).*":"TIR",
        "(?i).*(Kamyon)":"Kamyon"
    }
    
    taken_df.loc[:,"vehicle_type"] = taken_df.loc[:,"vehicle_type"].replace(vehicle_types,regex=True)
      
    # Extract the categorical columns
    categorical_cols = ['from_city', 'to_city',"vehicle_type","trailer_type"]
    
    label_encoders = {}
    
    for column in categorical_cols:
        label_encoder = LabelEncoder()
        taken_df[column] = label_encoder.fit_transform(taken_df[column])
        label_encoders[column] = label_encoder
        
    with open('label_encoders.pkl', 'wb') as file:
        pickle.dump(label_encoders, file)

    # Factor weights
    weight_factor = 0.6
    delivery_time_factor = 0.4
    offer_date_factor = 0.2
    
    # Convert offer date as UNIX timestamp
    taken_df['offer_date'] = pd.to_datetime(taken_df['offer_date'])
    epoch = dt.datetime(1970, 1, 1, tzinfo=pytz.UTC)
    taken_df['unix_offer_date'] = (taken_df['offer_date'] - epoch).dt.total_seconds()
    
    # Convert delivery date as UNIX timestamp
    taken_df['cargo_delivery_date'] = pd.to_datetime(taken_df['cargo_delivery_date'])
    taken_df['unix_delivery_time'] = (taken_df['cargo_delivery_date'] - epoch).dt.total_seconds()
    
    # min max scaling for normalization
    scaler = MinMaxScaler()
    
    # normalizing the weight column
    taken_df['normalized_weight'] = scaler.fit_transform(taken_df['weight'].values.reshape(-1, 1))
    
    # normalization of UNIX timestamps
    taken_df['normalized_offer_date'] = scaler.fit_transform(taken_df['unix_offer_date'].values.reshape(-1, 1))
    taken_df['normalized_delivery_time'] = scaler.fit_transform(taken_df['unix_delivery_time'].values.reshape(-1, 1))
    
    with open('scaler.pkl', 'wb') as f:
        pickle.dump(scaler, f)
    
    # Calculation of priority score
    taken_df['priority_score'] = (weight_factor * taken_df['normalized_weight']) + (offer_date_factor * taken_df['normalized_offer_date']) + (delivery_time_factor * taken_df['normalized_delivery_time'])
    
    

    return taken_df

I have tried this way, but it didnt worked, too..

 encoder = LabelEncoder()
    for col in categorical_cols:
        taken_df[col] = encoder.fit_transform(taken_df[col])
        
    with open('encoder.pkl', 'wb') as f:
        pickle.dump(encoder, f)
0

There are 0 best solutions below