I am new to NLP and hence not very clear on how to use it for my use case. My aim is to use NLP to get an idea of how frequently does a word or sentence from dataset 'A' occur in Dataset 'B'. It will not necessarily be an exact word but rather a similar one.
This is what the mock up data looks like:
Dataset 'A'
data = {'Name': ['Tom has a daughter', 'Joseph likes to fish', 'Krish is a new student/employee', 'John! What are you doing?'], 'City': ['London', 'Bristol', 'Leeds', 'London']} df1 = pd.DataFrame(data) print(df1)
Dataset B
data1 = {'Name': ['Krish is a new student/employee', 'The sky is blue', 'We are all humans', 'Tom has a daughter'], 'City': ['Leeds', 'Bristol', 'Leeds', 'London']} df2 = pd.DataFrame(data1) print(df2)
I would like to know how often the sentence 'Tom has a daughter' or a sentence with words 'Tom' or 'daughter' occurs in Dataset B.
My initial idea was to apply NLP preprocessing techniques (stopwrods, lowercase, punctuation, sentence tokenization) using nltk to Dataset A. Followed by a Bag of Words matrix representation.
I'm not able to think on how I can relate this to Dataset B. And is there a way to use the 'City' paramter to factor this in
I came across cosine and jacquard, but not sure on how to go about it with dataset B. Is there a way to see a pictorial representation and not a statistical number of the results?
Thanks in advance. I understand the question is vague abut that's also as I am trying to put it all together myself. Any hints would be great!

You may use
CountVectorizer.fit_transform()on DatasetA, and apply it on DatasetB withCountVectorizer.transform(). This way you'll get a vector to represent each element in the DatasetB based on DatasetA's corpus.Notice that today we're working with
embeddingsto find similar semantic meanings, and not with BOW.