speed up PyTextRank for summarizing a document

103 Views Asked by At

I need to summarize documents with spacy-pytextrank, what is the best approach to make it faster without increasing the resources of the machine?
I was thinking of parallelizing the computation using concurrent futures. Then apply texrank to each chunk. I know that in this way texrank would evaluate each chunk independently, but I don't see this as a problem if the chunks are sufficiently long.
Does anyone have any better ideas?

1

There are 1 best solutions below

4
On

Note that pytextrank is a pipeline component in spaCy so any parallel processing needs to take into account how spaCy runs and its architecture. Notably, there is one doc per large-ish "chunk" of text (i.e., source document) and it probably does not make sense to parallelize by reusing the doc objects, but instead focus on reusing the nlp object and parallelizing by running several doc pipelines concurrently. That's how other projects have handled this kind of situation you're describing.

As one of the committers on pytextrank, yes in fact we having been looking at ways to leverage concurrent futures in Python to help parallelize internally within the library. Also, we had a side project for a customer where we used similar Python concurrency through ray although the built-in asyncio in later versions of the language provide most of what we'd needed.

To be candid, there are probably better ways to summarize text using language models, though the extractive approach in pytextrank is unsupervised and fast. We had not been prioritizing much development for summarization features; however, there seems to be lots of interest.

What would help would be to know: Where do the resources get bottlenecked in your use case? In other words, is utilization of multi-cores low, or is the application I/O-bound? Then we can prioritize how to leverage language features for concurrency.