Fastest minimization algorithm to define optimal network

54 Views Asked by At

I have a large set of clients, and each of them have an id. They are related to each other by the column "Value", see below:

import pandas as pd
import numpy as np

# Example with 7 clients
df = pd.DataFrame(np.random.randint(7, size=(20, 2)), columns=['start id', 'end id'])
df = df[df['start id'] != df['end id']]
df = df.drop_duplicates()
df['Value'] = np.random.uniform(low=0.0, high=1.0, size=len(df))
print(df)

start id end id Value
0 1 4 0.315743
1 4 3 0.449567
2 4 2 0.945336
3 3 0 0.556950
5 3 4 0.002412
6 2 1 0.976020
8 4 0 0.480784
9 4 1 0.798300

And the score is the aggregation of values over "start id":

print(df.groupby('start id').agg(Score=('Value', 'sum')))
start id Score
1 0.315743
2 0.976020
3 0.559362
4 2.673987

My goal is to remove the lowest number of clients such that the score for each client is below a threshold. Let's assume that the threshold is 0.7. Removing 2 and 4 is enough, and the maximum score would be 0.55695:

df = df[~df['start id'].isin([2,4])]
df = df[~df['end id'].isin([2,4])]
print(df)
start id end id Value
3 3 0 0.55695
df_2 = df.groupby('start id').agg(Score=('Value', 'sum'))
print(df_2['Score'].max())
0.55695

I have thousands of clients that are connected, and a brute force approach (removing each possible combination is unfeasible), what optimization algorithm would you recommend?

1

There are 1 best solutions below

0
benvigano On

I think what you want to do is figure out which ids will generate scores higher than the threshold, then wipe the rows that involve those ids from the original dataframe (both as start and as end), then redo the grouping.

threshold = 0.7

# Pre-compute group-by
grouped_df = df.groupby('start id').agg(Score=('Value', 'sum'))

# Isolate ids to remove
ids_to_remove = grouped_df[grouped_df["Score"] > threshold].index

# Edit the dataframe
df = df[(~df['start id'].isin(ids_to_remove))&(~df['end id'].isin(ids_to_remove))]

# Compute final group-by
grouped_df = df.groupby('start id').agg(Score=('Value', 'sum'))

print(grouped_df['Score'].max())
>> 0.55695