I want to brainstorm an idea in MATLAB with you guys. Given a matrix with many columns (14K) and few rows (7) where columns are items and rows features of the items, I would like to compute the similarity with all items and keep it in matrix which is:
- Easy to compute
- Easy to access
for 1., I came up with a brilliant idea of using pdist() which is very fast:
A % my matrix
S = pdist(A') % computes the similarity btw all columns very fast
However accessing s is not convenient. I prefer to access similarity between item i and j , e.g. using S(i,j):
S(4,5) % is the similarity between item 4 and 5
In its original definition, S is an array not a matrix. Is making it as an 2D matrix a bad idea storage-wise? Could we think about a cool idea that can help me find which similaity corresponds to which items quickly?
Thank you.
You can use
pdist2(A',A'). What is returned is essentially the distance matrix in its standard form where element(i,j)is the dissimilarity (or similarity) between i-th and j-th pattern.Also, if you want to use
pdist(), which is ok, you can convert the resulting array into the well-known distance matrix by using the functionsquareform().So, in conclusion, if
Ais your dataset andSthe distance matrix, you can use eitheror
Now, regarding the storage point-of-view, you will certainly notice that such matrix is symmetric. What Matlab essentially proposes with the array
Sinpdist()is to save space: due to the fact that such matrix is symmetric you can as well save half of it in a vector. Indeed the arrayShasm(m-1)/2elements whereas the matrix form hasm^2elements (ifmis the number of patterns in your training set). On the other hand, most certainly is trickier to access such vector whereas the matrix is absolutely straightforward.