I have dataset with a list of transactions in this format:
| transaction_ID | card_number | transaction_datetime | amount | store |
|---|---|---|---|---|
| 1 | 123 | 2023-06-24 12:20:24 | 100.0 | A |
| 2 | 456 | 2023-08-27 23:12:00 | 250.0 | B |
| 3 | 123 | 2023-09-02 09:00:03 | 416.12 | A |
| 4 | 123 | 2023-09-02 10:30:03 | 6580.0 | C |
I have created a function by doing some research online that counts for each transaction, how many transactions has that card made in a specific time range, for example in the last hour, 3 days or 6 months:
def rolling_count(df, freq):
return (df.set_index(transaction_datetime)
.groupby("card_number")["card_number"]
.rolling(freq, closed="left")
.count()
.fillna(0)
.values
I then use the function like this:
df["number_transactions_lastday"] = rolling_count(df, "1D")
I now need to create other features that take into account the store as well. So instead of counting all the transactions that were made with that card in the past, to just count the ones that were made on the same store.
I have seen many examples online on how to add some conditions to these type of operations but none of the solutions work on my case.
How can I add a new column to my dataframe that does rolling counts while checking if the store is the same or not?
Example:
Input:
df["number_tr_store_last6m"] = rolling_count_store(df, "180D") # so 6 months
Expected output table:
| transaction_ID | card_number | transaction_datetime | amount | store | number_tr_store_last6m |
|---|---|---|---|---|---|
| 1 | 123 | 2023-06-24 12:20:24 | 100.0 | A | 0 |
| 2 | 456 | 2023-08-27 23:12:00 | 250.0 | B | 0 |
| 3 | 123 | 2023-09-02 09:00:03 | 416.12 | A | 1 |
| 4 | 123 | 2023-09-02 10:30:03 | 6580.0 | C | 0 |
My database is quite large, so the code needs to be optimized the best as possible.
Let's see the below solution. You can use
pandas.DataFrame.applyfunction which allows and pass multiple arguments for calculations.Output: