Calculating Moving Average on bi-temporal dataset

Question

Calculating Moving Average on bi-temporal dataset

237 Views Asked by Satish At 19 January 2021 at 16:06

I'm trying to calculate moving average on a bitemporal dataset. The dataset consists of a data date and an effective date(the date at which the data became available). The data for this date could be restated several times in future(same data date but a different effective date). I need to calculate a moving average for the past 4 quarters using the data that is valid for the effective date of the row being calculated.

The dataset looks like this

id	datadate	effdate	value
1	2005-03-31	2005-04-15	10
1	2005-03-31	2005-05-30	11
1	2005-06-30	2005-07-15	9
1	2005-06-30	2005-08-20	9.5
1	2005-06-30	2005-10-15	9.6
1	2005-09-30	2005-10-15	10.5
1	2005-09-30	2005-11-10	11
1	2005-09-30	2006-02-20	10.75
1	2005-12-31	2006-02-13	12
1	2005-12-31	2006-02-20	11.6
1	2005-12-31	2006-05-10	11
1	2006-03-31	2006-04-20	8
1	2006-03-31	2006-05-10	8.25

The result should be

	id	datadate	effdate	Value	MAvg
0	1	2005-03-31	2005-04-15	10	10
1	1	2005-03-31	2005-05-30	11	11
2	1	2005-06-30	2005-07-15	9	10
3	1	2005-06-30	2005-08-20	9.5	10.25
4	1	2005-06-30	2005-10-15	9.6	10.30
5	1	2005-09-30	2005-10-15	10.5	10.37
6	1	2005-09-30	2005-11-10	11	10.53
7	1	2005-09-30	2006-02-20	10.75	10.45
8	1	2005-12-31	2006-02-13	12	10.9
9	1	2005-12-31	2006-02-20	11.5	10.71
10	1	2005-12-31	2006-05-10	11	10.59
11	1	2006-03-31	2006-04-20	8	9.96
12	1	2006-03-31	2006-05-10	8.25	9.9

I'm doing this in python using pandas. The way I'm doing this is by joining the dataframe with itself on id and previous 4 quarters and calculating new effdates for all periods based on the effdates of the past 4 quarters, then I join once again with id, datadate and effdate and calculate the average.

keys["id"]
calc_df = df1.merge(df2, on=keys, how='left')
calc_df = calc_df.loc[
            (calc_df["datadate_x"] >= calc_df["datadate_y"])
            & (calc_df["datadate_y"] >= calc_df["datadate_x"] - pd.tseries.offsets.MonthEnd(n=9))
            & (calc_df["effdate_x"] <= calc_df["thrudate_y"])
            & (calc_df["thrudate_x"] >= calc_df["effdate_y"])
        ]
calc_df = calc_df.drop_duplicates().reset_index(drop=True)
grp_keys = keys + ["datadate_x"]
calc_df["effdate"] = calc_df[["effdate_x", "effdate_y"]].max(axis=1)
calc_df = calc_df.sort_values(grp_keys + ["effdate"]).drop_duplicates(
            subset=grp_keys + ["effdate"], keep="first"
        )
calc_df = calc_df['id', 'datadate_x', 'effdate', 'value']

calc_df = calc_df.merge(df1, on=["id"], how="left")
calc_df = calc_df.loc[
            (calc_df["datadate_x"] >= calc_df["datadate"])
            & (
                calc_df["datadate"]
                >= calc_df["datadate_x"] - pd.tseries.offsets.MonthEnd(n=9)
            )
            & (calc_df["effdate_x"] <= calc_df["thrudate_y"])
            & (calc_df["thrudate_x"] >= calc_df["effdate_y"])
        ]

        
calc_df["MAvg"] = calc_df.groupby(["id", "datadate_x", "effdate_x"])["value"].transform(
            lambda s: s.mean(skipna=False)
        )

This works but its very slow when I run it on the full dataset which has around 2000 differebt ids and datadate starting from 2000 to most recent quarter(around 500K rows) and I have to calcualate the moving averages on several fields for different windows. So I want to see if there is a efficient way of doing this.

Original Q&A

There are 1 best solutions below

**s3bw** · Answer 1 · 2021-01-25T21:45:04.377000

It's unclear what would constitute as an answer to this question, since you've only asked for something "better" and you've not indicated which axis you'd like this improvement to be on.

So I'll provide you something that is better, in terms of readability. This might help other find improvements in other directions:

Given the csv data.csv:

datadate,effdate,value
2005-03-31,2005-04-15,10
2005-03-31,2005-05-30,11
2005-06-30,2005-07-15,9
2005-06-30,2005-08-20,9.5
2005-06-30,2005-10-15,9.6
2005-09-30,2005-10-15,10.5
2005-09-30,2005-11-10,11
2005-09-30,2006-02-20,10.75
2005-12-31,2006-02-13,12
2005-12-31,2006-02-20,11.5
2005-12-31,2006-05-10,11
2006-03-31,2006-04-20,8
2006-03-31,2006-05-10,8.25

You can achieve the same outcome by doing the following:

import pandas as pd

data = pd.read_csv('data.csv').sort_values(by=["effdate"])


def moving_avg(df, quarters):
    df = df[
        df.index <= df.last_valid_index
    ].drop_duplicates(subset="datadate", keep="last").tail(quarters)
    return df["value"].sum() / df["value"].count()


print(data.assign(MAvg=[moving_avg(d, 4) for d in data.expanding()]).sort_index())

This provides:

      datadate     effdate  value       MAvg
0   2005-03-31  2005-04-15  10.00  10.000000
1   2005-03-31  2005-05-30  11.00  11.000000
2   2005-06-30  2005-07-15   9.00  10.000000
3   2005-06-30  2005-08-20   9.50  10.250000
4   2005-06-30  2005-10-15   9.60  10.300000
5   2005-09-30  2005-10-15  10.50  10.366667
6   2005-09-30  2005-11-10  11.00  10.533333
7   2005-09-30  2006-02-20  10.75  10.450000
8   2005-12-31  2006-02-13  12.00  10.900000
9   2005-12-31  2006-02-20  11.50  10.712500
10  2005-12-31  2006-05-10  11.00  10.587500
11  2006-03-31  2006-04-20   8.00   9.962500
12  2006-03-31  2006-05-10   8.25   9.900000

Calculating Moving Average on bi-temporal dataset

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in TIME-SERIES

Related Questions in BI-TEMPORAL

Trending Questions

Popular # Hahtags

Popular Questions