How to get the documents with n-gram repetitions in their summaries

230 Views Asked by Kiera.K At 31 October 2021 at 08:19

I am using the Xsum dataset for abstractive summarization. There are summaries that contain common ngrams. I need to get all the articles whose summaries contain these common ngrams.

For example, if I have the following articles and their corresponding summaries:

 Article       Summary

article1.    x a a b d m
article2.    x a b d c e m
article3.    y z c f a b d c e q u
article4.    m g a a b d v r a
article5.    r a e q u d x

And I want all documents having n-grams greater than or equal to 4, then the output should be:

Articles.            Common n-gram
article1, article4 :  a a b d
article2, article3 :  a b d c e

I have a dataset containing 200k articles and corresponding summaries.

What I have tried:

I tried using lucene to

Index the documents
For the ngrams of the summaries

But I don't know java and it's difficult to figure out how to get the documents with the common ngrams.

Help

Can someone please guide me as to how it can be done in python? Or if lucene, then if someone could please point me in the right direction? I have gone through the lucene tutorials but I didn't find anything to help with my specific need and I was only left more confused.

EDIT

I got this from a youtube video. My idea is that instead of the analyzer breaking the text into individual tokens, what if it breaks into ngrams. Then in my inverted index, I will have ngrams, their frequency and the documents they show up in.

Thank you.

Original Q&A

How to get the documents with n-gram repetitions in their summaries

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in LUCENE

Related Questions in N-GRAM

Related Questions in FULL-TEXT-INDEXING

Related Questions in SUMMARIZATION

Trending Questions

Popular # Hahtags

Popular Questions