Group customers by periods in python

86 Views Asked by At

I can't wrap my head around this problem. I have a sample Python Pandas DataFrame that holds data on vehicles and people who owned or co-owned them at different times. There are 4 important columns:

  • CREATED_ON_DT is when the record was created
  • CHANGED_ON_DT is when the record was changed
  • SALE_DT_KEY is when the vehicle was sold to the very first customer
  • RELATION_KEY is the ownership status:
    • 9 for the former owner
    • 4 for the former co-owner
    • 3 for the current owner
    • 5 for the current co-owner

Note:

  • for CREATED_ON_DT, the record does not change
  • for CHANGED_ON_DT, the ownership status is given by RELATION_KEY

The way to track a vehicle sale is to get the CHANGED_ON_DT value for status 9 if the next row is also status 9.

However, the problem is that owners and co-owners can switch back and forth and the point of sale would sometimes be status 9 to status 4, which means the former owner sold the car to someone who later switched from being the owner to a co-owner.

What I'm trying to do is to capture groups of owners. For example, suppose the following persons were involved in the selling and re-selling of a car over time:

  • Group 1: Person A sold an interest in a car they owned to their partner, Person B, who then became a co-owner.
  • Group 2: Persons A and B then sold the car to family C which switched owners and co-owners within the family.
  • Group 3: Family C then sold the car to Family D.

The following code has the sample data and my attempt to at least capture status periods. It also has a NEEDED_RESULT column.

Thank you in advance!

# Sample dataframe
copy_of_df = pd.DataFrame({
    "VEHICLE_KEY": [4331229, 4331229, 4331229, 4331229, 4331229, 4331229, 4331229, 4331229, 4331229],
    "PERSON_KEY": [10778254, 10778265, 20841772, 20122268, 20935263, 20962087, 21096582, 21096580, 21133677],
    "CREATED_ON_DT": pd.to_datetime(["2013-08-27 00:33:48", "2013-08-27 00:33:48", "2016-07-26 12:07:41", "2016-08-01 11:35:57", "2016-08-01 11:36:00", "2016-08-03 00:27:22", "2016-08-04 00:26:04", "2016-08-04 00:26:04", "2016-08-07 00:26:01"]),
    "CHANGED_ON_DT": pd.to_datetime(["2016-07-26 12:07:40", "2016-07-26 12:07:40", "2016-08-01 11:35:57", "2016-08-03 00:19:06", "2016-08-03 00:19:14", "2016-08-04 00:18:21", "2016-08-07 00:19:51", "2016-08-07 00:19:38", "2016-08-07 05:02:01"]),
    "SALE_DT_KEY": [20130826, 20130826, 20130826, 20130826, 20130826, 20130826, 20130826, 20130826, 20130826],
    "RELATION_KEY": [9, 4, 4, 9, 4, 9, 4, 9, 3],
    "NEEDED_RESULT":['group1', 'group1', 'group2', 'group2', 'group2', 'group3', 'group4', 'group4', 'group5']
})

# df
copy_of_df = copy_of_df.sort_values(['VEHICLE_KEY', 'PERSON_KEY', 'CREATED_ON_DT'])

# Define the status codes
current_owner = 3
former_owner = 9
coowner = 5
former_coowner = 4

# Global counter
global_counter = 0
total_rows = len(copy_of_df)

# Function to assign ownership-coownership combination
def assign_combination(group):
    global global_counter
    owner_count = 0
    last_owner_status = None
    ownership_start_date = group.iloc[0]['SALE_DT_KEY'] if pd.notnull(group.iloc[0]['SALE_DT_KEY']) else group.iloc[0]['CREATED_ON_DT']
    
    for i, row in group.iterrows():
        print(f"Processing row {global_counter+1} of {total_rows}")
        
        current_owner_status = row['RELATION_KEY']
        
        
        # Check if a new ownership period starts
        if current_owner_status in [current_owner, coowner] and (last_owner_status in [former_owner, former_coowner]):
            print('3')
            owner_count += 1            
            #group.loc[group['ownership_combination'] == f'{owner_count}ownercoowner', 'ownership_period'] = f'{ownership_start_date} - {pd.Timestamp("today")}'
            group.at[i, 'ownership_combination'] = f'{owner_count}ownercoowner'
            group.at[i, 'ownership_period'] = f'{ownership_start_date} - {pd.Timestamp("today")}'
        # Check if a new ownership period starts
        elif current_owner_status in [current_owner, coowner] and (last_owner_status is None):
            print('y')
            owner_count += 1
            #group.loc[group['ownership_combination'] == f'{owner_count}ownercoowner', 'ownership_period'] = f'{ownership_start_date} - {pd.Timestamp("today")}'
            group.at[i, 'ownership_combination'] = f'{owner_count}ownercoowner'
            group.at[i, 'ownership_period'] = f'{ownership_start_date} - {pd.Timestamp("today")}'
            #ownership_start_date = row['CHANGED_ON_DT']
        elif current_owner_status in [former_owner, former_coowner] and (last_owner_status in [former_owner, former_coowner]):
            print('2')
            
            owner_count += 1
            ownership_end_date = row['CHANGED_ON_DT']
            #group.loc[group['ownership_combination'] == f'{owner_count}ownercoowner', 'ownership_period'] = f'{ownership_start_date} - {ownership_end_date}'
            group.at[i, 'ownership_combination'] = f'{owner_count}ownercoowner'
            print(ownership_start_date)
            group.at[i, 'ownership_period'] = f'{ownership_start_date} - {ownership_end_date}'
            # Reset the start date for the next ownership period
            
            ownership_start_date = row['CHANGED_ON_DT']
            
        elif current_owner_status in [former_owner, former_coowner] and (last_owner_status is None):
            print('1')
            owner_count = 1
            ownership_start_date = row['SALE_DT_KEY']
            
            ownership_end_date = row['CHANGED_ON_DT']
            print(f'{ownership_start_date} - {ownership_end_date}')
            #group.loc[group['ownership_combination'] == f'{owner_count}ownercoowner', 'ownership_period'] = f'{ownership_start_date} - {ownership_end_date}'
            group.at[i, 'ownership_combination'] = f'{owner_count}ownercoowner'
            group.at[i, 'ownership_period'] = f'{ownership_start_date} - {ownership_end_date}'
            # Reset the start date for the next ownership period
            ownership_start_date = row['CHANGED_ON_DT']
        
        print(current_owner_status)
               
        last_owner_status = current_owner_status
        
        print(last_owner_status)
            
        
        group.loc[i, 'ownership_combination'] = f'{owner_count}ownercoowner'
  
        global_counter += 1
        

    return group

copy_of_df['ownership_combination'] = None
copy_of_df['ownership_period'] = None
copy_of_df = copy_of_df.groupby('VEHICLE_KEY').apply(assign_combination)
0

There are 0 best solutions below