Algorithm to find the words, that contain a sequence of words

529 Views Asked by Abhimanyue Singh Tanwar At 29 December 2021 at 10:51

I have a text file containing more than 100k words, each separated by a newline in the file. I want to implement a function that would return the list of words containing a given substring. For example: If the substring is "coat", then it would return words like "coating", "raincoat", "raincoats" etc.

Is there an efficient algorithm to implement something like this, given that the list of words won't change. I was thinking of implementing a generalized suffix tree using Ukkonen's algorithm, but I can't find a proper implementation of that anywhere. Is there any better way to solve this apart from using suffix trees?

EDIT : The maximum length of each word can be 100 characters and it only has lowercase alphabets.

Original Q&A

There are 1 best solutions below

AudioBubble On 29 December 2021 at 11:28

If you can afford the extra storage, you can generate all suffixes of all words and use a single sorted list (to perform binary searches), or a hash. This multiplies space by the average word length.

If this average exceeds 4, it can be preferable to store 32 bit pointers that index the suffixes in the original strings (assumed null-terminated). It is also possible to use 3-bytes integers, in a packed format.

Algorithm to find the words, that contain a sequence of words

There are 1 best solutions below

Related Questions in ALGORITHM

Related Questions in DATA-STRUCTURES

Related Questions in SUBSTRING

Related Questions in TRIE

Related Questions in SUFFIX-TREE

Trending Questions

Popular # Hahtags

Popular Questions