Introduction
I would like to assess the similarity between two "bin counts" arrays (related to two histograms), by using the Matlab "pdist2" function:
% Input
bin_counts_a = [689 430 311 135 66 67 99 23 37 19 8 4 3 4 1 3 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1];
bin_counts_b = [569 402 200 166 262 90 50 16 33 12 6 35 49 4 12 8 8 2 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1];
% Visualize the two "bin counts" vectors as bars:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])
% Calculation of similarities
cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
% Output
cosine_similarity =
0.95473215802008
jaccard_similarity =
0.0769230769230769
Question
If the cosine similarity is close to 1, which means the two vectors are similar, shouldn't the jaccard similarity be closer to 1 as well?

The
'jaccard'measure, according to the documentation, only considers the "percentage of nonzero coordinates that differ", but not by how much they differ.For instance, assume
bin_counts_aas in your example andThen
is almost
1as expected, because the bin counts are very similar. However,gives
0because each entry inbin_counts_bis (slightly) different from that inbin_counts_a.For assessing the similarity between the histograms,
'cosine'is probably a more meaningful option than'jaccard'. You may also want to consider the Kullback-Leibler divergence, although it is not symmetric in the two distributions, and is not computed bypdist2.