I can't wrap my head around this problem. I have a sample Python Pandas DataFrame that holds data on vehicles and people who owned or co-owned them at different times. There are 4 important columns:
- CREATED_ON_DT is when the record was created
- CHANGED_ON_DT is when the record was changed
- SALE_DT_KEY is when the vehicle was sold to the very first customer
- RELATION_KEY is the ownership status:
- 9 for the former owner
- 4 for the former co-owner
- 3 for the current owner
- 5 for the current co-owner
Note:
- for CREATED_ON_DT, the record does not change
- for CHANGED_ON_DT, the ownership status is given by RELATION_KEY
The way to track a vehicle sale is to get the CHANGED_ON_DT value for status 9 if the next row is also status 9.
However, the problem is that owners and co-owners can switch back and forth and the point of sale would sometimes be status 9 to status 4, which means the former owner sold the car to someone who later switched from being the owner to a co-owner.
What I'm trying to do is to capture groups of owners. For example, suppose the following persons were involved in the selling and re-selling of a car over time:
- Group 1: Person A sold an interest in a car they owned to their partner, Person B, who then became a co-owner.
- Group 2: Persons A and B then sold the car to family C which switched owners and co-owners within the family.
- Group 3: Family C then sold the car to Family D.
The following code has the sample data and my attempt to at least capture status periods. It also has a NEEDED_RESULT column.
Thank you in advance!
# Sample dataframe
copy_of_df = pd.DataFrame({
"VEHICLE_KEY": [4331229, 4331229, 4331229, 4331229, 4331229, 4331229, 4331229, 4331229, 4331229],
"PERSON_KEY": [10778254, 10778265, 20841772, 20122268, 20935263, 20962087, 21096582, 21096580, 21133677],
"CREATED_ON_DT": pd.to_datetime(["2013-08-27 00:33:48", "2013-08-27 00:33:48", "2016-07-26 12:07:41", "2016-08-01 11:35:57", "2016-08-01 11:36:00", "2016-08-03 00:27:22", "2016-08-04 00:26:04", "2016-08-04 00:26:04", "2016-08-07 00:26:01"]),
"CHANGED_ON_DT": pd.to_datetime(["2016-07-26 12:07:40", "2016-07-26 12:07:40", "2016-08-01 11:35:57", "2016-08-03 00:19:06", "2016-08-03 00:19:14", "2016-08-04 00:18:21", "2016-08-07 00:19:51", "2016-08-07 00:19:38", "2016-08-07 05:02:01"]),
"SALE_DT_KEY": [20130826, 20130826, 20130826, 20130826, 20130826, 20130826, 20130826, 20130826, 20130826],
"RELATION_KEY": [9, 4, 4, 9, 4, 9, 4, 9, 3],
"NEEDED_RESULT":['group1', 'group1', 'group2', 'group2', 'group2', 'group3', 'group4', 'group4', 'group5']
})
# df
copy_of_df = copy_of_df.sort_values(['VEHICLE_KEY', 'PERSON_KEY', 'CREATED_ON_DT'])
# Define the status codes
current_owner = 3
former_owner = 9
coowner = 5
former_coowner = 4
# Global counter
global_counter = 0
total_rows = len(copy_of_df)
# Function to assign ownership-coownership combination
def assign_combination(group):
global global_counter
owner_count = 0
last_owner_status = None
ownership_start_date = group.iloc[0]['SALE_DT_KEY'] if pd.notnull(group.iloc[0]['SALE_DT_KEY']) else group.iloc[0]['CREATED_ON_DT']
for i, row in group.iterrows():
print(f"Processing row {global_counter+1} of {total_rows}")
current_owner_status = row['RELATION_KEY']
# Check if a new ownership period starts
if current_owner_status in [current_owner, coowner] and (last_owner_status in [former_owner, former_coowner]):
print('3')
owner_count += 1
#group.loc[group['ownership_combination'] == f'{owner_count}ownercoowner', 'ownership_period'] = f'{ownership_start_date} - {pd.Timestamp("today")}'
group.at[i, 'ownership_combination'] = f'{owner_count}ownercoowner'
group.at[i, 'ownership_period'] = f'{ownership_start_date} - {pd.Timestamp("today")}'
# Check if a new ownership period starts
elif current_owner_status in [current_owner, coowner] and (last_owner_status is None):
print('y')
owner_count += 1
#group.loc[group['ownership_combination'] == f'{owner_count}ownercoowner', 'ownership_period'] = f'{ownership_start_date} - {pd.Timestamp("today")}'
group.at[i, 'ownership_combination'] = f'{owner_count}ownercoowner'
group.at[i, 'ownership_period'] = f'{ownership_start_date} - {pd.Timestamp("today")}'
#ownership_start_date = row['CHANGED_ON_DT']
elif current_owner_status in [former_owner, former_coowner] and (last_owner_status in [former_owner, former_coowner]):
print('2')
owner_count += 1
ownership_end_date = row['CHANGED_ON_DT']
#group.loc[group['ownership_combination'] == f'{owner_count}ownercoowner', 'ownership_period'] = f'{ownership_start_date} - {ownership_end_date}'
group.at[i, 'ownership_combination'] = f'{owner_count}ownercoowner'
print(ownership_start_date)
group.at[i, 'ownership_period'] = f'{ownership_start_date} - {ownership_end_date}'
# Reset the start date for the next ownership period
ownership_start_date = row['CHANGED_ON_DT']
elif current_owner_status in [former_owner, former_coowner] and (last_owner_status is None):
print('1')
owner_count = 1
ownership_start_date = row['SALE_DT_KEY']
ownership_end_date = row['CHANGED_ON_DT']
print(f'{ownership_start_date} - {ownership_end_date}')
#group.loc[group['ownership_combination'] == f'{owner_count}ownercoowner', 'ownership_period'] = f'{ownership_start_date} - {ownership_end_date}'
group.at[i, 'ownership_combination'] = f'{owner_count}ownercoowner'
group.at[i, 'ownership_period'] = f'{ownership_start_date} - {ownership_end_date}'
# Reset the start date for the next ownership period
ownership_start_date = row['CHANGED_ON_DT']
print(current_owner_status)
last_owner_status = current_owner_status
print(last_owner_status)
group.loc[i, 'ownership_combination'] = f'{owner_count}ownercoowner'
global_counter += 1
return group
copy_of_df['ownership_combination'] = None
copy_of_df['ownership_period'] = None
copy_of_df = copy_of_df.groupby('VEHICLE_KEY').apply(assign_combination)