I have already seen this answer to a similar question: https://stackoverflow.com/a/44311921/5881884
Where the ahocorasick algorithm is used to show if each word in a list exists in a string or not with O(n). But I want to get the frequency of each word in a list in a string.
For example if
my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]
I would want the result:
[2, 3, 1, 0]
I did not find an exact example for this in the documentation, any idea how to accomplish this?
Other O(n) solutions than using ahocorasick would also be appreciated.
Implementation:
Here's an Aho-Corasick frequency counter:
(For your example, you'd call
ac_frequency(my_list, my_string)
to get the list of counts)For medium-to-large inputs this will be substantially faster than other methods.
Notes:
For real data, this method will potentially yield different results than the other solutions posted, because Aho-Corasick looks for all occurrences of the target words, including substrings.
If you want to find full-words only, you can call
searcher.add_word
with space/punctuation-padded versions of the original string: