Let's say I have a really long string consists of 10^6 tokens (for simplicity, token is a space-separated word, so this string is splitted to list of tokens)
now I need to find all possible duplicated sequences and the start of the duplication locations in the string. For example:
(The brackets are not really in the string, they only to clarify the location)
this[0] string[1] is[2] test[3] to[4] check[5] duplication[6]
test[7] to[8] check[9] duplication[10] this[11] string[12]
==> at 0,11 - 2 tokens duplication
==> at 3,7 - 4 tokens duplication
I've tried to build Python program with an algorithm based on dictionary that keeps a list of each token index and checks token matches from those indexes. That is far too slow, even when I used Numpy instead of list.
Then I tried to use Suffix tree. But all methods tend to use letters rather than words. When I think of converting this algorithm to use tokens instead of letters, it could work if I used many small strings. The problem I have one huge string so it creates one long tree.
All the answer in Stackoverflow and all over the internet are not considering one long string. Any Ideas for best CPU performance algorithm? (RAM performance is less important) Thanks
You wish to identify repeated bi-grams.
Optionally construct a dictionary for converting
strtoint, if desired.Iterate over the document, generating a bi-gram for current position, then advance to next position. Store these in a
bigram_to_index_listdict in memory, or perhaps in an out-of-core file or database table. Adefaultdict(list)will prove convenient for the in-memory solution.Now iterate over all entries where we have multiple index position for a given bigram. Probe the original string to see if we can extend to a tri-gram or greater, and output such results.