Calculating Moving Average on bi-temporal dataset

237 Views Asked by At

I'm trying to calculate moving average on a bitemporal dataset. The dataset consists of a data date and an effective date(the date at which the data became available). The data for this date could be restated several times in future(same data date but a different effective date). I need to calculate a moving average for the past 4 quarters using the data that is valid for the effective date of the row being calculated.

The dataset looks like this

id datadate effdate value
1 2005-03-31 2005-04-15 10
1 2005-03-31 2005-05-30 11
1 2005-06-30 2005-07-15 9
1 2005-06-30 2005-08-20 9.5
1 2005-06-30 2005-10-15 9.6
1 2005-09-30 2005-10-15 10.5
1 2005-09-30 2005-11-10 11
1 2005-09-30 2006-02-20 10.75
1 2005-12-31 2006-02-13 12
1 2005-12-31 2006-02-20 11.6
1 2005-12-31 2006-05-10 11
1 2006-03-31 2006-04-20 8
1 2006-03-31 2006-05-10 8.25

The result should be

id datadate effdate Value MAvg
0 1 2005-03-31 2005-04-15 10 10
1 1 2005-03-31 2005-05-30 11 11
2 1 2005-06-30 2005-07-15 9 10
3 1 2005-06-30 2005-08-20 9.5 10.25
4 1 2005-06-30 2005-10-15 9.6 10.30
5 1 2005-09-30 2005-10-15 10.5 10.37
6 1 2005-09-30 2005-11-10 11 10.53
7 1 2005-09-30 2006-02-20 10.75 10.45
8 1 2005-12-31 2006-02-13 12 10.9
9 1 2005-12-31 2006-02-20 11.5 10.71
10 1 2005-12-31 2006-05-10 11 10.59
11 1 2006-03-31 2006-04-20 8 9.96
12 1 2006-03-31 2006-05-10 8.25 9.9

I'm doing this in python using pandas. The way I'm doing this is by joining the dataframe with itself on id and previous 4 quarters and calculating new effdates for all periods based on the effdates of the past 4 quarters, then I join once again with id, datadate and effdate and calculate the average.

keys["id"]
calc_df = df1.merge(df2, on=keys, how='left')
calc_df = calc_df.loc[
            (calc_df["datadate_x"] >= calc_df["datadate_y"])
            & (calc_df["datadate_y"] >= calc_df["datadate_x"] - pd.tseries.offsets.MonthEnd(n=9))
            & (calc_df["effdate_x"] <= calc_df["thrudate_y"])
            & (calc_df["thrudate_x"] >= calc_df["effdate_y"])
        ]
calc_df = calc_df.drop_duplicates().reset_index(drop=True)
grp_keys = keys + ["datadate_x"]
calc_df["effdate"] = calc_df[["effdate_x", "effdate_y"]].max(axis=1)
calc_df = calc_df.sort_values(grp_keys + ["effdate"]).drop_duplicates(
            subset=grp_keys + ["effdate"], keep="first"
        )
calc_df = calc_df['id', 'datadate_x', 'effdate', 'value']

calc_df = calc_df.merge(df1, on=["id"], how="left")
calc_df = calc_df.loc[
            (calc_df["datadate_x"] >= calc_df["datadate"])
            & (
                calc_df["datadate"]
                >= calc_df["datadate_x"] - pd.tseries.offsets.MonthEnd(n=9)
            )
            & (calc_df["effdate_x"] <= calc_df["thrudate_y"])
            & (calc_df["thrudate_x"] >= calc_df["effdate_y"])
        ]

        
calc_df["MAvg"] = calc_df.groupby(["id", "datadate_x", "effdate_x"])["value"].transform(
            lambda s: s.mean(skipna=False)
        )

This works but its very slow when I run it on the full dataset which has around 2000 differebt ids and datadate starting from 2000 to most recent quarter(around 500K rows) and I have to calcualate the moving averages on several fields for different windows. So I want to see if there is a efficient way of doing this.

1

There are 1 best solutions below

9
s3bw On

It's unclear what would constitute as an answer to this question, since you've only asked for something "better" and you've not indicated which axis you'd like this improvement to be on.

So I'll provide you something that is better, in terms of readability. This might help other find improvements in other directions:

Given the csv data.csv:

datadate,effdate,value
2005-03-31,2005-04-15,10
2005-03-31,2005-05-30,11
2005-06-30,2005-07-15,9
2005-06-30,2005-08-20,9.5
2005-06-30,2005-10-15,9.6
2005-09-30,2005-10-15,10.5
2005-09-30,2005-11-10,11
2005-09-30,2006-02-20,10.75
2005-12-31,2006-02-13,12
2005-12-31,2006-02-20,11.5
2005-12-31,2006-05-10,11
2006-03-31,2006-04-20,8
2006-03-31,2006-05-10,8.25

You can achieve the same outcome by doing the following:

import pandas as pd

data = pd.read_csv('data.csv').sort_values(by=["effdate"])


def moving_avg(df, quarters):
    df = df[
        df.index <= df.last_valid_index
    ].drop_duplicates(subset="datadate", keep="last").tail(quarters)
    return df["value"].sum() / df["value"].count()


print(data.assign(MAvg=[moving_avg(d, 4) for d in data.expanding()]).sort_index())

This provides:

      datadate     effdate  value       MAvg
0   2005-03-31  2005-04-15  10.00  10.000000
1   2005-03-31  2005-05-30  11.00  11.000000
2   2005-06-30  2005-07-15   9.00  10.000000
3   2005-06-30  2005-08-20   9.50  10.250000
4   2005-06-30  2005-10-15   9.60  10.300000
5   2005-09-30  2005-10-15  10.50  10.366667
6   2005-09-30  2005-11-10  11.00  10.533333
7   2005-09-30  2006-02-20  10.75  10.450000
8   2005-12-31  2006-02-13  12.00  10.900000
9   2005-12-31  2006-02-20  11.50  10.712500
10  2005-12-31  2006-05-10  11.00  10.587500
11  2006-03-31  2006-04-20   8.00   9.962500
12  2006-03-31  2006-05-10   8.25   9.900000