Applying LSA on term document matrix when number of documents are very less

222 Views Asked by At

I have a term-document matrix (X) of shape (6, 25931). The first 5 documents are my source documents and the last document is my target document. The column represents counts for different words in the vocabulary set. I want to get the cosine similarity of the last document with each of the other documents.

But since SVD produces an S of size (min(6, 25931),), If I used the S to reduce my X, I get a 6 * 6 matrix. But In this case, I feel that I will be losing too much information since I am reducing a vector of size (25931,) to (6,).

And when you think about it, usually, the number of documents will always be less than number of vocabulary words. In this case, using SVD to reduce dimensionality will always produce vectors that are of size (no documents,).

According to everything that I have read, when SVD is used like this on a term-document matrix, it's called LSA.

  1. Am I implementing LSA correctly?
  2. If this is correct, then is there any other way to reduce the dimensionality and get denser vectors where the size of the compressed vector is greater than (6,)?

P.S.: I also tried using fit_transform from sklearn.decomposition.TruncatedSVD which expects the vector to be of the form (n_samples, n_components) which is why the shape of my term-document matrix is (6, 25931) and not (25931, 6). I kept getting a (6, 6) matrix which initially confused me. But now it makes sense after I remembered the math behind SVD.

1

There are 1 best solutions below

0
SidharthMacherla On

If the objective of the exercise is to find the cosine similarity, then the following approach can help. The author is only attempting to solve for the objective and not to comment on the definition of Latent Semantic Analysis or the definition of Singular Value Decomposition mentioned by the questioner.


Let us first invoke all the required libraries. Please install them if they do not exist in the machine.

from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Let us generate some sample data for this exercise.

df = {'sentence': ['one two three','two three four','four five','six seven eight nine ten']}
df = pd.DataFrame(df, columns = ['sentence'])

The first step is to get the exhaustive list of all the possible features. So collate all of the content at one place.

all_content = [' '.join(df['sentence'])]

Let us build a vectorizer and fit it now. Please note that the arguments in the vectorizer are not explained by the author as the focus is on solving the problem.

vectorizer = TfidfVectorizer(encoding = 'latin-1',norm = 'l2', min_df = 0.03, ngram_range = (1,2), max_features = 5000)
vectorizer.fit(all_content)

We can inspect the vocabulary to see if it makes sense. If needed, one could add stop words in the vectorizer above and supress them to see if they are indeed supressed.

print(vectorizer.vocabulary_)

Let us vectorize the sentences for us to deploy cosine similarity.

s1Tokens =  vectorizer.transform(df.iloc[1,])
s2Tokens = vectorizer.transform(df.iloc[2,])

Finally, the cosine of the similarity can be computed as follows.

cosine_similarity(s1Tokens , s2Tokens)