I have a dictionary of strings that I would like to perform string searches on in real time (web application with approx. 1500 total users).
Background: I have a data table that follows the structure
| C1 | C2 | C3 | C4 | C5 | C6 | C7-13 |
|---|---|---|---|---|---|---|
| String | String | String | String | String | String | Bool |
- ~5000 rows (n)
- Longest String per column: 150 characters
- ~10000 Unique words
- Alphabet: Printable ASCII characters
- Each character in the alphabet is equally likely to appear in each string position.
Search Sudo Algorithm:
- On app startup read in the table as a dict = {C1:[0,...n], C2[0,..., n], ...}. 'Data' argument in minimal ex.
- User types a key (multi-word accepted) that is used to search through the dict values in columns C1-C6.
- Return any rows that contains similar strings to the search.
Minimal Search Example:
from fuzzywuzzy import fuzz
def similarity_sort(key, data):
result_rows = []
temp = []
for index in range(len(data['column 1'])):
check_r = perform_check(key, "{} {} {} {} {} {}".format(data['column1'][index], ['column2'][index], data['column3'][index], data['column4'][index], data['column5'][index], data['column6'][index]))
temp.append([index, check_r])
for x in temp:
if x[1] > 20:
result_rows.append(x[0])
result_rows.sort()
return result_rows
def perform_check(term, row_value):
check_ratio = fuzz.token_set_ratio(term.lower(), row_value.lower())
return check_ratio
With Search Key: 'SQL'
Expected to match with column containing values such as:
SQL
MySQL
SQLdatabase
MySQLdatabase
My question is would a Generalized Suffix Tree be a good fit for my string search problem?. If not are there other data structures I should look into?