Find frequency of words of one dataset in another dataset with NLP

65 Views Asked by Zara At 22 May 2023 at 15:20

I am new to NLP and hence not very clear on how to use it for my use case. My aim is to use NLP to get an idea of how frequently does a word or sentence from dataset 'A' occur in Dataset 'B'. It will not necessarily be an exact word but rather a similar one.

This is what the mock up data looks like:

Dataset 'A'

 data = {'Name': ['Tom has a daughter', 'Joseph likes to fish', 'Krish is a new student/employee', 'John! What are you doing?'], 'City': ['London', 'Bristol', 'Leeds', 'London']}  
 df1 = pd.DataFrame(data)  
 print(df1)

Dataset B

 data1 = {'Name': ['Krish is a new student/employee', 'The sky is blue', 'We are all humans', 'Tom has a daughter'], 'City': ['Leeds', 'Bristol', 'Leeds', 'London']}  
 df2 = pd.DataFrame(data1)  
 print(df2)

I would like to know how often the sentence 'Tom has a daughter' or a sentence with words 'Tom' or 'daughter' occurs in Dataset B.

My initial idea was to apply NLP preprocessing techniques (stopwrods, lowercase, punctuation, sentence tokenization) using nltk to Dataset A. Followed by a Bag of Words matrix representation.

I'm not able to think on how I can relate this to Dataset B. And is there a way to use the 'City' paramter to factor this in

I came across cosine and jacquard, but not sure on how to go about it with dataset B. Is there a way to see a pictorial representation and not a statistical number of the results?

Thanks in advance. I understand the question is vague abut that's also as I am trying to put it all together myself. Any hints would be great!

Original Q&A

There are 1 best solutions below

Sahar Millis On 23 May 2023 at 10:57

You may use CountVectorizer.fit_transform() on DatasetA, and apply it on DatasetB with CountVectorizer.transform(). This way you'll get a vector to represent each element in the DatasetB based on DatasetA's corpus.

Notice that today we're working with embeddings to find similar semantic meanings, and not with BOW.

Find frequency of words of one dataset in another dataset with NLP

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in NLP

Related Questions in VECTORIZATION

Related Questions in COSINE-SIMILARITY

Related Questions in WORD-FREQUENCY

Trending Questions

Popular # Hahtags

Popular Questions