I have a pandas dataframe like below:
import pandas as pd
nan = float('nan')
data = {'col1': [1, nan, nan, nan, nan, 1, nan, nan],
'col2': [1, 1, nan, 1, 0, 0, 1, 0],
'col3': [nan, 0, nan, 1, 0, nan, nan, nan],
'col4': [1, 0, 0, 1, 0, 1, 1, 1]}
df = pd.DataFrame(data)
df
|col1| |col2| |col3| |col4|
| 1 | | 1 | | NaN| | 1 |
|NaN | | 1 | | 0 | | 0 |
|NaN | | NaN| | NaN| | 0 |
|NaN | | 1 | | 1 | | 1 |
|NaN | | 0 | | 0 | | 0 |
| 1 | | 0 | | NaN| | 1 |
|NaN | | 1 | | NaN| | 1 |
|NaN | | 0 | | NaN| | 1 |
I want to count the number of consecutive nulls (NaN) values for every column, and if there's more than two consecutive nulls, I want to get the max of it.
For the above df, I would get:
df_nulls = ['col1': 4, 'col2': 0, 'col3': 3, 'col4': 0]
With the above results, the columns with more than two consecutive nulls should be deleted. In this case, the final dataframe should only contain col2 and col4. I found similar threads but none resolved the above issue. How can i fix this problem? Thanks in advance.
Code
transform+maxout
or use
agginsteadtransform+maxsame result