I want to assign two control plots to each of the project plots. In the first iteration, I want to assign the nearest control plot (with the minimum value of dist_project_plot_id*) to the project plot being evaluated. If this control plot has already been assigned to a project plot, we look for the next nearest control plot.
Once all project plots have been assigned a first control plot, we assign a second control plot to each project plot, following the same criterion: find the control plot with the minimum distance as long as it has not been previously assigned to another project plot.
I have a dataframe which looks like:
`data = {
'control_plot_id': [1526258, 1507770, 1539206, 1528123, 2019722, 1504105],
'dist_project_plot_id1': [3025.22, 2670.43, 2140.41, 1697.68, 3999.77, 2783.97],
'dist_project_plot_id2': [488.07, 427.82, 1180.68, 1386.38, 4739.51, 590.44],
'dist_project_plot_id3': [2033.15, 2193.51, 2958.56, 3168.14, 5573.02, 2008.31]
}
df = pd.DataFrame(data)`
Where: control_plot_id represents the identifiers of the control plots dist_project_plot_id1 represents the distance between the control plot and project plot 1 dist_project_plot_id2 represents the distance between the control plot and project plot 2 etc.
I have tried the first search in the following code:
import pandas as pd
df = pd.DataFrame(data)
# Add new columns "PP" and "dist"
df['PP'] = ''
df['dist'] = np.nan
# Get the column names starting with 'project_plot_id'
project_columns = [col for col in df.columns if col.startswith('project_plot_id')]
# Iterate over the project_plot_id columns
for col in project_columns:
# Sort the dataframe by the current column in ascending order
df_sorted = df.sort_values(col)
# Find the k-nearest control plots for the current column
k = 1 # Set the value of k
nearest_control_plots = []
for i in range(k):
min_value = df_sorted.loc[~df_sorted['control_plot_id'].isin(nearest_control_plots)].head(1)[['control_plot_id', col]]
nearest_control_plots.append(min_value['control_plot_id'].values[0])
df.loc[df['control_plot_id'] == min_value['control_plot_id'].values[0], 'PP'] = col
df.loc[df['control_plot_id'] == min_value['control_plot_id'].values[0], 'dist'] = min_value[col].values[0]
What I am not able to program is that if a control plot has already been selected for a project plot, the code should continue searching for the next nearest control plot, which could be the third, the fourth, or even the last one. Maybe a particular library is doing what I am looking for.
The expected output should have the following columns:
- control_plot_id: original control plot id
- PP: project plot for which the control plot has been selected
- dist: distance between the control plot and the control plot selected
Okay, based on the description this is what you are looking for:
This returns:
This does what you ask for:
However your example output is different from what you explained you wanted, not sure why...
NOTE after edit: If instead you wanted one row per project then, as with my answer prior to the edit when you clarified your expected output:
which would return: