How to find mean value of k nearest points when some values are NaN?

38 Views Asked by At

I have a geometric dataset of point features associated with values. Out of ~ 16000 values, about 100-200 have NaNs. I'd like to populate those with the average of the values from the 5 nearest neighbors, assuming at least 1 of them is not also associated with a NaN. The dataset looks something like:

    FID PPM_P   geometry
0   0   NaN POINT (-89.79635 35.75644)
1   1   NaN POINT (-89.79632 35.75644)
2   2   NaN POINT (-89.79629 35.75644)
3   3   NaN POINT (-89.79625 35.75644)
4   4   NaN POINT (-89.79622 35.75644)
5   5   NaN POINT (-89.79619 35.75644)
6   6   NaN POINT (-89.79616 35.75644)
7   7   NaN POINT (-89.79612 35.75645)
8   8   NaN POINT (-89.79639 35.75641)
9   9   40.823028   POINT (-89.79635 35.75641)
10  10  40.040865   POINT (-89.79632 35.75641)
11  11  36.214436   POINT (-89.79629 35.75641)
12  12  34.919571   POINT (-89.79625 35.75642)
13  13  NaN POINT (-89.79622 35.75642)
14  14  NaN POINT (-89.79619 35.75642)
15  15  NaN POINT (-89.79615 35.75642)
16  16  NaN POINT (-89.79612 35.75642)
17  17  NaN POINT (-89.79609 35.75642)
18  18  NaN POINT (-89.79606 35.75642)
19  19  NaN POINT (-89.79642 35.75638)

It just so happens that many of the NaNs are near the beginning of the dataset.

I found the nearest neighbor weight matrix using:

w_knn = KNN.from_dataframe(predictions_gdf, k=5)

Next I wrote:

# row-normalise weights
w_knn.transform = "r"

# create lag
predictions_gdf["averaged_PPM_P"] = libpysal.weights.lag_spatial(w_knn, predictions_gdf["PPM_P"])

But I got back NaN in the averaged_PPM_P column. Now I'm not sure what to do. Can someone give me a hand please?

1

There are 1 best solutions below

3
Timeless On

Here is one possible option using cKDTree.query from :

from scipy.spatial import cKDTree

def knearest(gdf, **kwargs):
    notna = gdf["PPM_P"].notnull()
    coordinates = gdf.get_coordinates().to_numpy()

    dist, idx = (
        cKDTree(coordinates[notna]).query(
            coordinates[~notna], **kwargs)
    )

    _ser = pd.Series(
        gdf.loc[notna, "PPM_P"].to_numpy()[idx].tolist(),
        index=(~notna)[lambda s: s].index,
    )

    gdf.loc[~notna, "PPM_P"] = _ser[~notna].map(np.mean)

    return gdf

N = 2 # feel free to make it 5, or whatever..

out = knearest(gdf.to_crs(3662), k=range(1, N + 1))#.to_crs(4326)

Output (with N=2):

enter image description here

NB: Each red point (an FID having no PPM_P), is associated with the N nearest green points :

{
    1: [0, 10],
    2: [11, 12],
    3: [12, 4],
    5: [14, 4],
    6: [14, 7],
    8: [9, 0],
    13: [4, 14],
    15: [14, 7],
    16: [7, 14],
    17: [7, 14],
    18: [7, 14],
    19: [9, 0],
}

Final GeoDataFrame (with intermediates):


    FID  PPM_P (OP)  PPM_P (INTER)      PPM_P                       geometry
0     0   34.919571            NaN  34.919571  POINT (842390.581 539861.877)
1     1         NaN      37.480218  37.480218  POINT (842399.476 539861.532)
2     2         NaN      35.567003  35.567003  POINT (842408.370 539861.187)
3     3         NaN      35.567003  35.567003  POINT (842420.229 539860.726)
4     4   36.214436            NaN  36.214436  POINT (842429.124 539860.381)
5     5         NaN      38.127651  38.127651  POINT (842438.018 539860.036)
6     6         NaN      40.431946  40.431946  POINT (842446.913 539859.691)
7     7   40.823028            NaN  40.823028  POINT (842458.913 539862.868)
8     8         NaN      37.871299  37.871299  POINT (842378.298 539851.425)
9     9   40.823028            NaN  40.823028  POINT (842390.158 539850.965)
10   10   40.040865            NaN  40.040865  POINT (842399.052 539850.620)
11   11   36.214436            NaN  36.214436  POINT (842407.947 539850.275)
12   12   34.919571            NaN  34.919571  POINT (842419.947 539853.452)
13   13         NaN      38.127651  38.127651  POINT (842428.841 539853.107)
14   14   40.040865            NaN  40.040865  POINT (842437.736 539852.761)
15   15         NaN      40.431946  40.431946  POINT (842449.595 539852.301)
16   16         NaN      40.431946  40.431946  POINT (842458.489 539851.956)
17   17         NaN      40.431946  40.431946  POINT (842467.384 539851.611)
18   18         NaN      40.431946  40.431946  POINT (842476.278 539851.266)
19   19         NaN      37.871299  37.871299  POINT (842368.981 539840.859)