Adding rows for missing months and Fill na values with last value in a partition Python

69 Views Asked by Jresearcher At 12 July 2023 at 10:59

I have a pandas dataframe and I need to fill NULL values with the last value on a partition. Specifically, for each "id" and "month", I need to create and explode last "value" on subsequent months.

Example of my dataset:

id	month	value
1	2023-01-01	London
1	2023-02-01	Paris
2	2023-01-01	New York
3	2023-02-01	Paris
4	2023-03-01	NULL

My desidered output (Exploding the values up to April 2023):

id	month	value
1	2023-01-01	London
1	2023-02-01	Paris
1	2023-03-01	Paris
1	2023-04-01	Paris
2	2023-01-01	New York
2	2023-02-01	New York
2	2023-03-01	New York
2	2023-04-01	New York
3	2023-02-01	Paris
3	2023-03-01	Paris
3	2023-04-01	Paris
4	2023-03-01	NULL
4	2023-04-01	NULL

Thank u!

Original Q&A

There are 1 best solutions below

mozway On 12 July 2023 at 11:03 BEST ANSWER

One option using reshaping as a rectangular intermediate with pivot/stack:

df['month'] = pd.to_datetime(df['month'])

out = (df
    .pivot(index='id', columns='month', values='value')
    .reindex(columns=pd.date_range('2023-01-01', '2023-04-01', freq='MS').rename('month'))
    .ffill(axis=1)
    .stack().reset_index(name='value')
)

Or with a MultiIndex and groupby.ffill:

out = (df
   .set_index(['id', 'month'])
   .reindex(pd.MultiIndex.from_product([df['id'].unique(),
                                        pd.date_range('2023-01-01', '2023-04-01', freq='MS')],
                                      names=['id', 'month']
                                      ))
   .groupby(level=0).ffill().dropna().reset_index()
)

Output:

    id      month     value
0    1 2023-01-01    London
1    1 2023-02-01     Paris
2    1 2023-03-01     Paris
3    1 2023-04-01     Paris
4    2 2023-01-01  New York
5    2 2023-02-01  New York
6    2 2023-03-01  New York
7    2 2023-04-01  New York
8    3 2023-02-01     Paris
9    3 2023-03-01     Paris
10   3 2023-04-01     Paris

keeping/filling original NaNs:

You can add a helper column to identify the original NaNs:

out = (df
   .set_index(['id', 'month']).assign(flag=True)
   .reindex(pd.MultiIndex.from_product([df['id'].unique(),
                                        pd.date_range('2023-01-01', '2023-04-01', freq='MS')],
                                      names=['id', 'month']
                                      ))
   .groupby(level=0).ffill().loc[lambda d: d.pop('flag').notna()].reset_index()
)

Output:

    id      month     value
0    1 2023-01-01    London
1    1 2023-02-01     Paris
2    1 2023-03-01     Paris
3    1 2023-04-01     Paris
4    2 2023-01-01  New York
5    2 2023-02-01  New York
6    2 2023-03-01  New York
7    2 2023-04-01  New York
8    3 2023-02-01     Paris
9    3 2023-03-01     Paris
10   3 2023-04-01     Paris
11   4 2023-03-01       NaN
12   4 2023-04-01       NaN

Adding rows for missing months and Fill na values with last value in a partition Python

There are 1 best solutions below

keeping/filling original NaNs:

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in EXPLODE

Related Questions in FILL

Trending Questions

Popular # Hahtags

Popular Questions