Propagation of uncertainties in dataframes with different missing values

81 Views Asked by At

I have several dataframes each representing a different area of a dataset, and have a routine for reducing these several datasets into a single curve. My issue is with propagating the uncertainties throughout the process, since each dataset has slightly different missing rows.

Each dataframe contains XYE, where there are two types of Y: sample (s) and background (b), something like this:

cols = ['x', 's', 'b', 's_err', 'b_err']
A = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 0], [6, 7, 0], [1, 1, 0], [2, 2, 0]]), columns=cols)
B = pd.DataFrame(np.array([[1, 2, 3], [4, 0, 5], [6, 0, 7], [1, 0, 1], [2, 0, 2]]), columns=cols)

etc., although in reality there are several 's' and 'b' columns in each dataframe.

My plan is to use uncertainties to propagate the error through the protocol, which involves taking weighted averages as well as simple addition and subtraction. The data error are given as data variance, so the weighted average needs to divide the square root of the error, which obviously needs to ignore the zeroes.

The question is this: how to exclude zeroes from the calculations on each dataframe in such a way that I can then combine/compare them at the end? I assume that there is something in numpy's masking abilities that would be helpful, but at this point I can't picture it.

1

There are 1 best solutions below

0
LioWal On

I understand that what you want is to filter out rows in your dataframe where the values in all columns are equal to zero (thrid row of A and second of B). If I take back your code (I had to transpose the array), this is how I would make it work :

cols = ['x', 's', 'b', 's_err', 'b_err'] 
A = pd.DataFrame(np.transpose(np.array([[1, 2, 3], [4, 5, 0], [6, 7, 0], [1, 1, 0], [2, 2, 0]])), columns=cols)
B = pd.DataFrame(np.transpose(np.array([[1, 2, 3], [4, 0, 5], [6, 0, 7], [1, 0, 1], [2, 0, 2]])), columns=cols)
filter_A =  (A.s!=0) | (A.b!=0) | (A.s_err!=0) | (A.b_err!=0)
filter_B =  (B.s!=0) | (B.b!=0) | (B.s_err!=0) | (B.b_err!=0)
A_clean = A[filter_A]
B_clean = B[filter_B]

Simply said, we create 2 filters, one for A and one for B. The filter gives True on each sample where the value in any of the column is not equal to zero and False on others where all values are zero. We then apply the filter on the dataframe and it return the dataframe without the "zero colums". Is it what you were looking for and does it make sense ?