Looking for a practical implementation of suffix arrays, I came across this paper. It outlines a O(n (log n * log n)) approach, where n is the length of the string. While there are faster algorithms available, IMO, none is suitable in a programming contest or interview setting. But I found the explanation of the algorithm really lacking. I quote:
The algorithm is mainly based on maintaining the order of the string’s suffixes sorted by their 2k long prefixes. We shall execute m = [log2 n] (ceil) steps, computing the order of the prefixes of length 2k at the kth step. It is used an m x n sized matrix. Let’s denote by Aik the subsequence of A of length 2k starting at position i. The position of Aik in the sorted array of Ajk subsequences (j = 1, n) is kept in P(k, i).
When passing from step k to step k + 1, all the pairs of subsequences Aik and Ai+2^kk are concatenated, therefore obtaining the substrings of length 2k+1. For establishing the new order relation among these, the information computed at the previous step must be used. For each index i it is kept a pair formed by P(k, i) and P(k, i+2^k). The fact that i + 2k may exceed the string’s bounds must not bother us, because we shall fill the string with the "$" character, about which we shall consider that it’s lexicographically smaller than any other character. After sorting, the pairs will be arranged conforming to the lexicographic order of the strings of length 2k+1. One last thing that must be remembered is that at a certain step k there may be several substrings Aik = Ajk, and these must be labeled identically (P(k, i) must be equal to P(k, j)).
The paper provides a C++ implementation, but it's very difficult to glean any insight from the code when the variables are named L, P, and nr, so, I made a simplified implementation in Python. Note that the variable names are to the best of my understanding, which is what I'm hoping to get more clarification on.
def suffix_array(text: str) -> list[int]:
@dataclasses.dataclass(order=True)
class Entry:
prefix: list[int] = dataclasses.field(default_factory=lambda: [0, 0])
start: int = dataclasses.field(default_factory=int, compare=False)
n = len(text)
m = math.ceil(math.log2(n)) + 2
pos = [ord(x) for x in text]
suffixes = [Entry() for _ in range(n)]
for step in range(1, m):
cnt = 2 ** (step - 1)
for i in range(n):
suffixes[i].prefix[0] = pos[i]
suffixes[i].prefix[1] = pos[j] if (j := i + cnt) < n else -1
suffixes[i].start = i
suffixes.sort()
for i in range(n):
if i > 0 and suffixes[i] == suffixes[i - 1]:
pos[suffixes[i].start] = pos[suffixes[i - 1].start]
else:
pos[suffixes[i].start] = i
return [i.start for i in suffixes]
Now, my questions:
- How do we know the algorithm works correctly? The paper provides no proof, and I don't understand how sorting the relative positions of the prefixes is also keeping the prefixes sorted.
- In plain English, what are the purposes of the variables
suffixes,posandEntry.prefixin the code above? These are namedL,P, andnr, respectively, in the C++ code. - At the first iteration (
step=1),suffixes[i].prefix(L[i].nrin C++) becomes the start and the end of a prefix of length 2 starting attext[i].suffixes(Lin C++) is sorted, and positions are stored inpos(Pin C++). At the next iteration (step=2),suffixes[i].prefixarray is populated based onpos[1], so, it is no longer the start and end of a prefix of length 4. What issuffixes[i].prefixthen? - On page 6, it is said "The suffix array will be found on the last row of matrix
P", but this appears to be incorrect. The suffix array is actually given by thestart(pin C++) values of the arraysuffixes(Lin C++).
OP here. A detailed explanation of this algorithm is given in the book Competitive Programming 3 by Steven Halim - Chapter 6.6.4 Suffix Array. A visualization of the algorithm is available here.
I got it now, here are my notes.
Basically, the algorithm works sort of like Radix Sort, whereas parts of the input are sorted first, then the parts after that. By sorting integers and not strings, the algorithm avoids the string comparisons that make the naive algorithm quadratic. At each iteration, the algorithm does the following:
Example: text="GATAGACA".
Iteration 1, suffixes of length 2 are considered. Before and after sorting:
Iteration 2, suffixes of length 4 are considered. Before and after sorting:
Iteration 3 is not shown, suffixes of length 8 are considered.