I have a geometric dataset of point features associated with values. Out of ~ 16000 values, about 100-200 have NaNs. I'd like to populate those with the average of the values from the 5 nearest neighbors, assuming at least 1 of them is not also associated with a NaN. The dataset looks something like:
FID PPM_P geometry
0 0 NaN POINT (-89.79635 35.75644)
1 1 NaN POINT (-89.79632 35.75644)
2 2 NaN POINT (-89.79629 35.75644)
3 3 NaN POINT (-89.79625 35.75644)
4 4 NaN POINT (-89.79622 35.75644)
5 5 NaN POINT (-89.79619 35.75644)
6 6 NaN POINT (-89.79616 35.75644)
7 7 NaN POINT (-89.79612 35.75645)
8 8 NaN POINT (-89.79639 35.75641)
9 9 40.823028 POINT (-89.79635 35.75641)
10 10 40.040865 POINT (-89.79632 35.75641)
11 11 36.214436 POINT (-89.79629 35.75641)
12 12 34.919571 POINT (-89.79625 35.75642)
13 13 NaN POINT (-89.79622 35.75642)
14 14 NaN POINT (-89.79619 35.75642)
15 15 NaN POINT (-89.79615 35.75642)
16 16 NaN POINT (-89.79612 35.75642)
17 17 NaN POINT (-89.79609 35.75642)
18 18 NaN POINT (-89.79606 35.75642)
19 19 NaN POINT (-89.79642 35.75638)
It just so happens that many of the NaNs are near the beginning of the dataset.
I found the nearest neighbor weight matrix using:
w_knn = KNN.from_dataframe(predictions_gdf, k=5)
Next I wrote:
# row-normalise weights
w_knn.transform = "r"
# create lag
predictions_gdf["averaged_PPM_P"] = libpysal.weights.lag_spatial(w_knn, predictions_gdf["PPM_P"])
But I got back NaN in the averaged_PPM_P column. Now I'm not sure what to do. Can someone give me a hand please?
Here is one possible option using
cKDTree.queryfrom scipy:Output (with
N=2):NB: Each red point (an
FIDhaving noPPM_P), is associated with theNnearest green points :Final GeoDataFrame (with intermediates):