Algorithm to find all duplicate sequences of tokens in a long string

364 Views Asked by At

Let's say I have a really long string consists of 10^6 tokens (for simplicity, token is a space-separated word, so this string is splitted to list of tokens)

now I need to find all possible duplicated sequences and the start of the duplication locations in the string. For example:

(The brackets are not really in the string, they only to clarify the location)

this[0] string[1] is[2] test[3] to[4] check[5] duplication[6]
test[7] to[8] check[9] duplication[10] this[11] string[12]

==> at 0,11 - 2 tokens duplication
==> at 3,7 - 4 tokens duplication

I've tried to build Python program with an algorithm based on dictionary that keeps a list of each token index and checks token matches from those indexes. That is far too slow, even when I used Numpy instead of list.

Then I tried to use Suffix tree. But all methods tend to use letters rather than words. When I think of converting this algorithm to use tokens instead of letters, it could work if I used many small strings. The problem I have one huge string so it creates one long tree.

All the answer in Stackoverflow and all over the internet are not considering one long string. Any Ideas for best CPU performance algorithm? (RAM performance is less important) Thanks

2

There are 2 best solutions below

2
J_H On

You wish to identify repeated bi-grams.

Optionally construct a dictionary for converting str to int, if desired.

Iterate over the document, generating a bi-gram for current position, then advance to next position. Store these in a bigram_to_index_list dict in memory, or perhaps in an out-of-core file or database table. A defaultdict(list) will prove convenient for the in-memory solution.

Now iterate over all entries where we have multiple index position for a given bigram. Probe the original string to see if we can extend to a tri-gram or greater, and output such results.

7
SudoKoach On

@Izik Since I'm a new contributor I can't add a comment! Like suggested in @J_H's comment the only alternative to reduce "drastically" the searching time is to use the hashing technique. Here is a snippet coded in Java that works for a short token list. Maybe there's an equivalent of the HashMap class in Python.

    String[] tokens = new String[]{"this", "string", "is", "test", "to",
    "check", "duplication", "test", "to", "check", "duplication", "this",
    "string", "this", "string", "is", "test", "to", "check", "duplication",
    "test", "to", "check"};
List<tp> tp_list = new List();
HashMap<String, Integer> token_map = new HashMap();

class tp {

    String t;
    List<Integer> poslist;

    tp(String tok, int pos) {
        this.t = tok;
        this.poslist = new List();
        this.poslist.add(pos);
    }
}

void createDuplicateLists() {

    tp_list.add(new tp(tokens[0], 0));
    int i = 1;
    int j = 0;
    token_map.clear();
    token_map.put(tokens[0], j);
    while (i < tokens.length) {
        String tok = tokens[i];
        if (token_map.containsKey(tok)) {
            tp tkp = tp_list.get(token_map.get(tok));
            tkp.poslist.add(i);
        } else {
            tp_list.add(new tp(tok, i));
            j++;
            token_map.put(tok, j);
        }
        i++;
    }
}

void main(String[] args) {

    createDuplicateLists();
    printLists();
}

/*
Printed Lists:

this:[0, 11, 13]
string:[1, 12, 14]
is:[2, 15]
test:[3, 7, 16, 20]
to:[4, 8, 17, 21]
check:[5, 9, 18, 22]
duplication:[6, 10, 19]
*/