I have a large set of clients, and each of them have an id. They are related to each other by the column "Value", see below:
import pandas as pd
import numpy as np
# Example with 7 clients
df = pd.DataFrame(np.random.randint(7, size=(20, 2)), columns=['start id', 'end id'])
df = df[df['start id'] != df['end id']]
df = df.drop_duplicates()
df['Value'] = np.random.uniform(low=0.0, high=1.0, size=len(df))
print(df)
| start id | end id | Value | |
|---|---|---|---|
| 0 | 1 | 4 | 0.315743 |
| 1 | 4 | 3 | 0.449567 |
| 2 | 4 | 2 | 0.945336 |
| 3 | 3 | 0 | 0.556950 |
| 5 | 3 | 4 | 0.002412 |
| 6 | 2 | 1 | 0.976020 |
| 8 | 4 | 0 | 0.480784 |
| 9 | 4 | 1 | 0.798300 |
And the score is the aggregation of values over "start id":
print(df.groupby('start id').agg(Score=('Value', 'sum')))
| start id | Score |
|---|---|
| 1 | 0.315743 |
| 2 | 0.976020 |
| 3 | 0.559362 |
| 4 | 2.673987 |
My goal is to remove the lowest number of clients such that the score for each client is below a threshold. Let's assume that the threshold is 0.7. Removing 2 and 4 is enough, and the maximum score would be 0.55695:
df = df[~df['start id'].isin([2,4])]
df = df[~df['end id'].isin([2,4])]
print(df)
| start id | end id | Value | |
|---|---|---|---|
| 3 | 3 | 0 | 0.55695 |
df_2 = df.groupby('start id').agg(Score=('Value', 'sum'))
print(df_2['Score'].max())
0.55695
I have thousands of clients that are connected, and a brute force approach (removing each possible combination is unfeasible), what optimization algorithm would you recommend?
I think what you want to do is figure out which ids will generate scores higher than the threshold, then wipe the rows that involve those ids from the original dataframe (both as start and as end), then redo the grouping.