I have a list of images names and a (thresholded) similarity matrix for them. The similarity relationship is reflexive and symmetric but not necessary transitive, i.e. if image_i is similar to image_j and to image_k, then it doesn't necessary mean that image_j and image_k are similar.
For example:
images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']
sm = np.array([[1, 1, 1, 0, 1],
[1, 1, 0, 0, 1],
[1, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[1, 1, 0, 0, 1]])
The similarity matrix sm is interpreted as follows: if sm[i, j] == 1 then image_i and image_j are similar, otherwise they are not similar. Here we see that image_0 is similar to image_1 and image_2, but image_1 and image_2 are not similar (this is just one example of non-transitivity).
I want to keep the maximum number of unique images (that are all pairwise non-similar according to the given sm matrix). For this example it would be [image_2, image_3, image_4] or [image_1, image_2, image_3] (in general there are multiple such subsets but I don't mind which to keep as long as they are of maximum length). I am looking for an efficient way to do this since I have thousands of images.
Edit: My original solution was the following
np.array(images)[np.tril(sm).sum(0) == 1]
However it's not guaranteed that it will return a maximun length subset. Consider the following example:
sm = np.array([[1, 1, 0, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 1, 1, 1],
[0, 0, 0, 1, 1]])
This solution will return ['image_1', 'image_4'], whereas the desired result is ['image_0', 'image_2', 'image_4'] or ['image_1', 'image_2', 'image_4'].
Update: Please see my answer which explains the problem in more detail using graph theory. I am still open to suggestions since I haven't found a reasonably fast way to achieve the result for a list of thousands of images.
After researching it a little bit more, I found that this is the so called maximum independent set problem in graph theory, which is unfortunately NP-hard.
An independent set S of a graph G is a subset of vertices of G, such that no vertices in S are adjacent to one another. In our case, we are looking to find a maximum independent set (MIS), i.e. an independent set with the largest possible number of vertices.
There are a couple of libraries for working with graphs and networks, such as igraph or NetworkX, which have functions for finding maximum independent sets. I ended up using igraph.
For my problem, we can think of the images as vertices of a graph G and the "similarity matrix" as the adjacency matrix:
Unfortunately this is too slow for thousands of images (vertices). So I am still open to suggestions for a faster way to do it (perhaps instead of finding all the MIS, just find one).
Note: the proposed solutions by @Sergey (UPDATE#1) and @marke don't always return a MIS -- they are greedy approximate algorithms which delete a vertex of maximum degree until no edge remains. To demonstrate this, consider the following example:
Both solutions return
[3, 5], but for this example the maximum independent sets are two,[(0, 3, 4), (1, 2, 5)], as are correctly found byigraph. To see why these solutions fail to find the MIS, below is a gif that shows how the vertices and edges are removed at each iteration (which is the "side effect" ofnp.argmaxreturning the first occurrence for multiple occurrences of the maximum value):The Sergey's solution (UPDATE#2) seems to work, however it is much much slower than the igraph's
largest_independent_vertex_sets(). For speed comparison you can use the following randomly generated similarity matrix of length 100:Update: it turns out that although I have thousands of images - vertices, the number of edges is relatively small (i.e. I have a sparse graph), so using igraph to find MIS is acceptable it terms of speed. Alternatively, as a compromise, one could use a greedy approximate algorithm for finding a large independent set (or a MIS if lucky enough). Below is an algorithm which seems pretty fast:
Or even better, we can get a maximal independent set: