I'm trying to get the most different combinations of a given set of variables:values but keeping every element more or less equally distributed.
For example, for given:
{ 'cat0' : [0,1,2,3,4,5],
'cat1' : [0,1,2,3,4,5],
'cat2' : [0,1,2,3,4,5]
}
I generate the all combinations dataframe, where each line is a possible and unique combination of the elements of the previously defined variables.
| cat0 | cat1 | cat2 |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 0 | 1 |
... ...
And so.
For example, if the given number of rows is 6, the output can be similar to:
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4
5,5,5
The expected output have to keep every row as maximum distant as possible from the others. And also, each component must be similarly distributed. For example, if the given number of rows is 11 the expected output could be similar to:
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4
5,5,5
0,1,2
1,2,3
4,5,4
2,3,0
3,4,1
As you can see for each 'cat' all the values are equally distributed (as much as possible) and each combination is the as different as possible from the previously selected ones.
I have made a function but it does not cover the full problem:
def get_distanced_creatives(n, combinations_df, weighted_variables = {
'cat1': 0.33,
'cat2': 0.33,
'cat3': 0.33,
}):
def scalar_product(v1, v2, weighted_variables = weighted_variables):
adding = 0
for var in weighted_variables:
if v1[var] != v2[var]:
adding += weighted_variables[var]
return adding
distance_matrix = np.array(list(itertools.starmap(scalar_product, itertools.product([comb[1] for comb in combinations_df.iterrows()],[comb[1] for comb in combinations_df.iterrows()])))).reshape(len(combinations_df), len(combinations_df))
initial = np.random.randint(len(combinations_df))
list_elements = [initial]
iteration = 0
while (len(list_elements) < n):
aux = distance_matrix[list_elements].sum(axis = 0)
aux2 = distance_matrix[list_elements].sum(axis = 1)
list_ordered = sorted(range(len(aux)), key=lambda k: -aux[k])
for i in list_ordered:
if i not in list_elements:
list_elements.append(i)
break
return list_elements, combinations_df.iloc[list_elements]
It implements only the part of distribute equally each element but it generates a non desired output. For example, for the previous combinations dataframe, given n=11 it outputs:
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4
5,5,5
0,1,1
1,2,2
2,3,3
3,4,4
4,5,5
As you can see the output keeps distributed the values for each variable but the combinations are not the most possible different ones as the second and the seventh ends equal.
How can I correct this?
Thanks