Dataframes: Filtering based on Overlapping Dates

60 Views Asked by At

I have 2 dataframes, Requests and Services. Below is an example of the data they contain:

Request Event Dataframe

     Client ID Event Reference Event Start Date Event End Date
0          100             100       01/04/2023     04/04/2023
1          100             101       07/06/2023     07/06/2023
2          101             102       01/04/2023     04/04/2023
3          102             103       01/05/2023     10/05/2023

{'Client ID': {0: 100, 1: 100, 2: 101, 3: 102}, 'Event Reference': {0: 100, 1: 101, 2: 102, 3: 103}, 'Event Start Date': {0: '01/04/2023', 1: '07/06/2023', 2: '01/04/2023', 3: '01/05/2023'}, 'Event End Date': {0: '04/04/2023', 1: '07/06/2023', 2: '04/04/2023', 3: '10/05/2023'}}

Services Event Dataframe

     Client ID Event Reference Event Start Date Event End Date
0          100            1000       01/02/2022     04/03/2023
1          100            1001       10/04/2023            NaN
2          102            1002       01/04/2023     01/05/2023
3          102            1003       02/05/2023     10/07/2023

{'Client ID': {0: 100, 1: 100, 2: 102, 3: 102}, 'Event Reference': {0: 1000, 1: 1001, 2: 1002, 3: 1003}, 'Event Start Date': {0: '01/02/2022', 1: '10/04/2023', 2: '01/04/2023', 3: '02/05/2023'}, 'Event End Date': {0: '04/03/2023', 1: nan, 2: '01/05/2023', 3: '10/07/2023'}}

Please note that all requests will have end dates, but some services will not, i.e. they are ongoing. Not all clients in the Requests dataframe will have a service.

I need to determine the following - for each request event for a specific client, did they have any service which was active during the request period. If they did then that request needs to go into a "currently open" dataframe. If they did not then that request needs to go into a "new request" dataframe. Based on the above data i would want to see the following result dataframes:

Currently Open Dataframe

     Client ID Event Reference Event Start Date Event End Date
0          100             101       07/06/2023     07/06/2023
1          102             103       01/05/2023     10/05/2023

New Request Dataframe

     Client ID Event Reference Event Start Date Event End Date
0          100             100       01/04/2023     04/04/2023
1          101             102       01/04/2023     04/04/2023

I am pretty new to python coding so I have been using ChatGPT to give me a leg-up. It came up with some code that I've been tweaking to no avail so far (please note LTS is my name for the services dataframe):

# Intialise an empty dataframe to store results

new_requests = pd.DataFrame()

# Iterate over each row in the requests dataframe
for index, request_row in requests.iterrows():
    
    # Filter relevant rows in the LTS dataframe for the current client
    relevant_lts_rows = lts[lts['Client ID'] == request_row['Client ID']]
    
    # Check for overlapping events
    overlapping_events = relevant_lts_rows[
        (relevant_lts_rows['Event Start Date'] <= request_row['Event End Date']) & 
        ((relevant_lts_rows['Event End Date'] > request_row['Event Start Date']) | pd.isnull(relevant_lts_rows['Event End Date']))]
    
    # If no overlapping events, add current request to the new_requests dataframe
    if overlapping_events.empty:
        new_requests = pd.concat([new_requests, pd.DataFrame(request_row).transpose()])

# Reset index for new_requests dataframe
new_requests.reset_index(drop=True, inplace=True)

print(new_requests)

Unfortunately it does not seem to extract the correct requests into the new_requests dataframe. Any suggestions would be most appreciated!

1

There are 1 best solutions below

0
Chizo On

Ok the issue has been solved, with help from someone on another forum :) The key problem is that my date fields are not being read as dates but as strings (I think). The answer is to force python to format my event start date and event end date columns as dates when I import my csv data:

date_columns = ["Event Start Date", "Event End Date"] date_formats = "%d/%m/%Y" rbc_data = pd.read_csv(rbc_cld, parse_dates=date_columns, date_format=date_formats)

After that it appears that the correct request events are going into the correct dataframes based on checking if dates are overlapping.