I need to identify all substrings in a string with a minimum size and repeats. The caveat is that I don't want substrings returned that are themselves substrings of other returned substrings. In other words the set of substrings needs to be a disjoint set. The function below works but is very inefficient. 97.9% of the time is spent re-calculating the suffix array and LCP. I settled on this because after I remove the last substring from the string and re-calculate the SA and LCP, I can guarantee that no substrings of the last substring would be added. Is there a more efficient way to do this that would require calculating the SA and LCP once?
from typing import Dict, List, NamedTuple
import numpy as np
import pandas as pd
from pydivsufsort import divsufsort, kasai
class Substring(NamedTuple):
length: int
count: int
def find_unique_repeat_substrings(s: bytes, min_length: int = 20, min_repeats: int = 10) -> Dict[str, Substring]:
string_dict = dict()
K = len(s)
while K>=min_length:
sa = divsufsort(s)
lcp = kasai(s, sa)
K_loc = np.argmax(lcp)
K=np.max(lcp)
#calculate number of repeats
loc = K_loc+1
while lcp[loc]==K:
loc += 1
cnt = loc-K_loc+1
longest_string = s[sa[K_loc]:sa[K_loc]+K]
#add substring to dict
if cnt >= min_repeats and K>=min_length:
string_dict[longest_string.decode()] = Substring(length=K, count=cnt)
#remove substring
s = s.replace(longest_string, b"") # Replacing with bytes
return(string_dict)
s = "this string is repeated three times in this sentence. string string.".encode()
string_dict = find_unique_repeat_substrings(s,min_length = 4, min_repeats=2)
string_dict
{' string ': Substring(length=8, count=2), 'this': Substring(length=4, count=2)}
After the answer to @stef I thought I had a better direction. I'm pretty sure this can be improved on but it's much faster than the original and probably O(n log n).