Practical implementation of suffix array

Question

Practical implementation of suffix array

141 Views Asked by Abhijit Sarkar At 10 July 2023 at 06:28

Looking for a practical implementation of suffix arrays, I came across this paper. It outlines a O(n (log n * log n)) approach, where n is the length of the string. While there are faster algorithms available, IMO, none is suitable in a programming contest or interview setting. But I found the explanation of the algorithm really lacking. I quote:

The algorithm is mainly based on maintaining the order of the string’s suffixes sorted by their 2^k long prefixes. We shall execute m = [log₂ n] (ceil) steps, computing the order of the prefixes of length 2^k at the k^th step. It is used an m x n sized matrix. Let’s denote by A_i^k the subsequence of A of length 2^k starting at position i. The position of A_i^k in the sorted array of A_j^k subsequences (j = 1, n) is kept in P_(k,
i).

When passing from step k to step k + 1, all the pairs of subsequences A_i^k and A_i+2^k^k are concatenated, therefore obtaining the substrings of length 2^k+1. For establishing the new order relation among these, the information computed at the previous step must be used. For each index i it is kept a pair formed by P_{(k, i)} and P_(k,
i+2^k). The fact that i + 2^k may exceed the string’s bounds must not bother us, because we shall fill the string with the "$" character, about which we shall consider that it’s lexicographically smaller than any other character. After sorting, the pairs will be arranged conforming to the lexicographic order of the strings of length 2^k+1. One last thing that must be remembered is that at a certain step k there may be several substrings A_i^k = A_j^k, and these must be labeled identically (P_{(k, i)} must be equal to P_(k,
j)).

The paper provides a C++ implementation, but it's very difficult to glean any insight from the code when the variables are named L, P, and nr, so, I made a simplified implementation in Python. Note that the variable names are to the best of my understanding, which is what I'm hoping to get more clarification on.

def suffix_array(text: str) -> list[int]:
    @dataclasses.dataclass(order=True)
    class Entry:
        prefix: list[int] = dataclasses.field(default_factory=lambda: [0, 0])
        start: int = dataclasses.field(default_factory=int, compare=False)

    n = len(text)
    m = math.ceil(math.log2(n)) + 2
    pos = [ord(x) for x in text]
    suffixes = [Entry() for _ in range(n)]
    for step in range(1, m):
        cnt = 2 ** (step - 1)
        for i in range(n):
            suffixes[i].prefix[0] = pos[i]
            suffixes[i].prefix[1] = pos[j] if (j := i + cnt) < n else -1
            suffixes[i].start = i
        suffixes.sort()
        for i in range(n):
            if i > 0 and suffixes[i] == suffixes[i - 1]:
                pos[suffixes[i].start] = pos[suffixes[i - 1].start]
            else:
                pos[suffixes[i].start] = i

    return [i.start for i in suffixes]

Now, my questions:

How do we know the algorithm works correctly? The paper provides no proof, and I don't understand how sorting the relative positions of the prefixes is also keeping the prefixes sorted.
In plain English, what are the purposes of the variables suffixes, pos and Entry.prefix in the code above? These are named L, P, and nr, respectively, in the C++ code.
At the first iteration (step=1), suffixes[i].prefix (L[i].nr in C++) becomes the start and the end of a prefix of length 2 starting at text[i]. suffixes (L in C++) is sorted, and positions are stored in pos (P in C++). At the next iteration (step=2), suffixes[i].prefix array is populated based on pos[1], so, it is no longer the start and end of a prefix of length 4. What is suffixes[i].prefix then?
On page 6, it is said "The suffix array will be found on the last row of matrix P", but this appears to be incorrect. The suffix array is actually given by the start (p in C++) values of the array suffixes (L in C++).

Original Q&A

There are 1 best solutions below

**Abhijit Sarkar** · Accepted Answer · 2023-07-10T22:52:31.093000

OP here. A detailed explanation of this algorithm is given in the book Competitive Programming 3 by Steven Halim - Chapter 6.6.4 Suffix Array. A visualization of the algorithm is available here.

I got it now, here are my notes.

Basically, the algorithm works sort of like Radix Sort, whereas parts of the input are sorted first, then the parts after that. By sorting integers and not strings, the algorithm avoids the string comparisons that make the naive algorithm quadratic. At each iteration, the algorithm does the following:

Update the ranking pair based on the ranks computed in the previous iteration.
Sort the suffixes based on the updated ranking pairs.
Recompute the ranks of the suffixes. To calculate the new ranks, we set the first one to have new rank r = 0. Then, we iterate from i = [1..n-1]. If the ranking pair of suffix 'i' is different from the ranking pair of the previous suffix in sorted order, we set the rank r = i. Otherwise, the rank stays at r.

Example: text="GATAGACA".

Iteration 1, suffixes of length 2 are considered. Before and after sorting:

+---+--------+--------------+     +---+--------+--------------+------+
| i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+     +---+--------+--------------+------+
| 0 | GA     | [71, 65]     |     | 7 | A      | [65, -1]     |    0 |
| 1 | AT     | [65, 84]     |     | 5 | AC     | [65, 67]     |    1 |
| 2 | TA     | [84, 65]     |     | 3 | AG     | [65, 71]     |    2 |
| 3 | AG     | [65, 71]     |     | 1 | AT     | [65, 84]     |    3 |
| 4 | GA     | [71, 65]     |     | 6 | CA     | [67, 65]     |    4 |
| 5 | AC     | [65, 67]     |     | 0 | GA     | [71, 65]     |    5 |
| 6 | CA     | [67, 65]     |     | 4 | GA     | [71, 65]     |    5 |
| 7 | A      | [65, -1]     |     | 2 | TA     | [84, 65]     |    7 |
+---+--------+--------------+     +---+--------+--------------+------+

Iteration 2, suffixes of length 4 are considered. Before and after sorting:

+---+--------+--------------+     +---+--------+--------------+------+
| i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+     +---+--------+--------------+------+
| 7 | A      | [0, 2]       |     | 7 | A      | [0, 2]       |    0 |
| 5 | AC     | [1, 3]       |     | 5 | AC     | [1, 3]       |    1 |
| 3 | AGAC   | [2, 4]       |     | 3 | AGAC   | [2, 4]       |    2 |
| 1 | ATAG   | [3, 5]       |     | 1 | ATAG   | [3, 5]       |    3 |
| 6 | CA     | [4, 5]       |     | 6 | CA     | [4, 5]       |    4 |
| 0 | GATA   | [5, 7]       |     | 4 | GACA   | [5, 6]       |    5 |
| 4 | GACA   | [5, -1]      |     | 0 | GATA   | [5, -1]      |    6 |
| 2 | TAGA   | [7, -1]      |     | 2 | TAGA   | [6, -1]      |    7 |
+---+--------+--------------+     +---+--------+--------------+------+

Iteration 3 is not shown, suffixes of length 8 are considered.

Practical implementation of suffix array

There are 1 best solutions below

Related Questions in ARRAYS

Related Questions in STRING

Related Questions in ALGORITHM

Related Questions in SUFFIX-TREE

Related Questions in SUFFIX-ARRAY

Trending Questions

Popular # Hahtags

Popular Questions