Algorithm to find all duplicate sequences of tokens in a long string

Question

Algorithm to find all duplicate sequences of tokens in a long string

364 Views Asked by Izik At 25 October 2022 at 18:35

Let's say I have a really long string consists of 10^6 tokens (for simplicity, token is a space-separated word, so this string is splitted to list of tokens)

now I need to find all possible duplicated sequences and the start of the duplication locations in the string. For example:

(The brackets are not really in the string, they only to clarify the location)

this[0] string[1] is[2] test[3] to[4] check[5] duplication[6]
test[7] to[8] check[9] duplication[10] this[11] string[12]

==> at 0,11 - 2 tokens duplication
==> at 3,7 - 4 tokens duplication

I've tried to build Python program with an algorithm based on dictionary that keeps a list of each token index and checks token matches from those indexes. That is far too slow, even when I used Numpy instead of list.

Then I tried to use Suffix tree. But all methods tend to use letters rather than words. When I think of converting this algorithm to use tokens instead of letters, it could work if I used many small strings. The problem I have one huge string so it creates one long tree.

All the answer in Stackoverflow and all over the internet are not considering one long string. Any Ideas for best CPU performance algorithm? (RAM performance is less important) Thanks

Original Q&A

There are 2 best solutions below

**J_H** · Answer 1 · 2022-10-25T18:50:11.897000

You wish to identify repeated bi-grams.

Optionally construct a dictionary for converting str to int, if desired.

Iterate over the document, generating a bi-gram for current position, then advance to next position. Store these in a bigram_to_index_list dict in memory, or perhaps in an out-of-core file or database table. A defaultdict(list) will prove convenient for the in-memory solution.

Now iterate over all entries where we have multiple index position for a given bigram. Probe the original string to see if we can extend to a tri-gram or greater, and output such results.

**SudoKoach** · Answer 2 · 2022-11-01T15:40:47.053000

@Izik Since I'm a new contributor I can't add a comment! Like suggested in @J_H's comment the only alternative to reduce "drastically" the searching time is to use the hashing technique. Here is a snippet coded in Java that works for a short token list. Maybe there's an equivalent of the HashMap class in Python.

    String[] tokens = new String[]{"this", "string", "is", "test", "to",
    "check", "duplication", "test", "to", "check", "duplication", "this",
    "string", "this", "string", "is", "test", "to", "check", "duplication",
    "test", "to", "check"};
List<tp> tp_list = new List();
HashMap<String, Integer> token_map = new HashMap();

class tp {

    String t;
    List<Integer> poslist;

    tp(String tok, int pos) {
        this.t = tok;
        this.poslist = new List();
        this.poslist.add(pos);
    }
}

void createDuplicateLists() {

    tp_list.add(new tp(tokens[0], 0));
    int i = 1;
    int j = 0;
    token_map.clear();
    token_map.put(tokens[0], j);
    while (i < tokens.length) {
        String tok = tokens[i];
        if (token_map.containsKey(tok)) {
            tp tkp = tp_list.get(token_map.get(tok));
            tkp.poslist.add(i);
        } else {
            tp_list.add(new tp(tok, i));
            j++;
            token_map.put(tok, j);
        }
        i++;
    }
}

void main(String[] args) {

    createDuplicateLists();
    printLists();
}

/*
Printed Lists:

this:[0, 11, 13]
string:[1, 12, 14]
is:[2, 15]
test:[3, 7, 16, 20]
to:[4, 8, 17, 21]
check:[5, 9, 18, 22]
duplication:[6, 10, 19]
*/

Algorithm to find all duplicate sequences of tokens in a long string

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in ALGORITHM

Related Questions in TRIE

Related Questions in SUFFIX-TREE

Trending Questions

Popular # Hahtags

Popular Questions