How can we get multiple cosine similarity calculations on combinations of word documents?

82 Views Asked by At

I'm trying to figure out a way to loop through resumes in a folder, then loop through job descriptions in another folder, and find the cosine similarity of each resume to each job description. Here's the code that I'm testing.

import os
import docx2txt
import warnings
warnings.filterwarnings('ignore')

# all resumes     
ext = ('.docx')
resume_path = 'C:\\Users\\Cosine Similarity\\resumes\\'

resumes = []
# load the data
for files in os.listdir(resume_path):
    if files.endswith(ext):
        resumes.append(files)


# all job descriptions
ext = ('.docx')
job_path = 'C:\\Users\\Cosine Similarity\\job_descriptions\\'
 
jobs = []      
# load the data
for files in os.listdir(job_path):
    if files.endswith(ext):
        jobs.append(files)  

print(resumes)
print(jobs)


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import itertools


documents = resumes + jobs
#print(documents)

#indexes = [index for index in range(len(documents))]


# Vectorize the documents
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(documents)

# Get all combinations of documents
combinations = list(itertools.combinations(range(len(documents)), 2))

# Calculate cosine similarity for each pair of documents
similarities = []
for i, j in combinations:
    sim = cosine_similarity(vectors[i], vectors[j])[0][0]
    similarities.append((i, j, sim))

# Print the results
for i, j, sim in similarities:
    print(f"Similarity between document {i+1} and {j+1}: {sim:.2f}")

This kind of works, but not really. Here's my output.

Similarity between document 1 and 2: 0.14
Similarity between document 1 and 3: 0.20
Similarity between document 1 and 4: 0.34
Similarity between document 1 and 5: 0.41
Similarity between document 1 and 6: 0.25
Similarity between document 1 and 7: 0.19
Similarity between document 1 and 8: 0.52
etc., etc., etc.

The problem is that I don't know how to map the 1, 2, 3, 4, 5, 6, 7, and 8 to the name of the docx file that it represents.

This is resumes: ['aaron.docx', 'dennis.docx', 'ryan.docx', 'tim.docx', 'tom.docx']

This is jobs: ['job_description1.docx', 'job_description2.docx', 'job_description3.docx']

Finally, I'm trying to compare 'aaron.docx' to 'job_description1.docx', 'job_description2.docx', & 'job_description3.docx'. Then dennisto 'job_description1.docx', 'job_description2.docx', & 'job_description3.docx', etc. I think the code is comparing aaron to dennis and arron to ryan. That's not right at all.

0

There are 0 best solutions below