How to get the documents with n-gram repetitions in their summaries

230 Views Asked by At

I am using the Xsum dataset for abstractive summarization. There are summaries that contain common ngrams. I need to get all the articles whose summaries contain these common ngrams.

For example, if I have the following articles and their corresponding summaries:

 Article       Summary

article1.    x a a b d m
article2.    x a b d c e m
article3.    y z c f a b d c e q u
article4.    m g a a b d v r a
article5.    r a e q u d x

And I want all documents having n-grams greater than or equal to 4, then the output should be:

Articles.            Common n-gram
article1, article4 :  a a b d
article2, article3 :  a b d c e

I have a dataset containing 200k articles and corresponding summaries.

What I have tried:

I tried using lucene to

  1. Index the documents
  2. For the ngrams of the summaries

But I don't know java and it's difficult to figure out how to get the documents with the common ngrams.

Help

Can someone please guide me as to how it can be done in python? Or if lucene, then if someone could please point me in the right direction? I have gone through the lucene tutorials but I didn't find anything to help with my specific need and I was only left more confused.

EDIT enter image description here

I got this from a youtube video. My idea is that instead of the analyzer breaking the text into individual tokens, what if it breaks into ngrams. Then in my inverted index, I will have ngrams, their frequency and the documents they show up in.

Thank you.

0

There are 0 best solutions below