How should I structure code/files to avoid repeating time consuming calls to build a data structure?

55 Views Asked by At

I am working on a word puzzle game solver that needs to determine whether or not some candidate word is valid. I am storing all of the valid words in a trie data structure, after they have been loaded from a text file. Building this trie (adding around 370,000 words to the trie) takes around 5 seconds on my computer.

The main part of my program takes an image of a puzzle (a codeword puzzle, to be specific), interprets the puzzle and solves it using a depth-first search. The problem I have right now is that building the trie takes 5 seconds and when a user wants to solve a new image, the trie is rebuilt each time, even though it never changes. The rest of the process, from loading the puzzle image to having a finished result, takes around 1 second, so it is unacceptable to wait 5 seconds to build a trie every time.

I am currently getting around this by partitioning my main script into sections to isolate the trie building code. This is done using #%% in the Spyder IDE, like you can do in Matlab, or having a new cell in a Jupyter notebook.

Here is how I have things structured right now:

# codeword_solver_trie.py

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_word = False
        
        
class WordDictionary:
    def __init__(self):
        self.root = TrieNode()      

    def addWord(self, word):
        current_node = self.root
        for character in word:
            current_node = current_node.children.setdefault(character, TrieNode())
        current_node.is_word = True
        
    def search(self, word):
        def dfs(node, index):
            if index == len(word):
                return node.is_word
               
            if word[index] == ".":
                for child in node.children.values():
                    if dfs(child, index+1):
                        return True
                    
            if word[index] in node.children:
                return dfs(node.children[word[index]], index+1)
            
            return False
        return dfs(self.root, 0)

I have another file where the main processing occurs. It contains this function:

def build_word_dictionary_trie(words):
    print("Building trie...")
    word_trie = WordDictionary()
    for word in words:
        word_trie.addWord(word)
    return word_trie

The relevant part of the main script is as follows:

word_list_fpath = "data/words_alpha.txt"
all_words = load_all_words(word_list_fpath)

# Build the trie from the word list
word_trie = build_word_dictionary_trie(all_words)

#%% (this separates the execution of the code above from the code below)

# Load image from file and convert from BGR to RGB
original_img = cv2.imread(f"data/codeword_images/{img_num}.jpg")
# ...

How can I build this trie exactly once and have the user be able to re-run the processing an arbitrary number of times without resorting to this hack of sectioning the code with #%%?

0

There are 0 best solutions below