number of occurrences of list of words in a string with O(n)

701 Views Asked by At

I have already seen this answer to a similar question: https://stackoverflow.com/a/44311921/5881884

Where the ahocorasick algorithm is used to show if each word in a list exists in a string or not with O(n). But I want to get the frequency of each word in a list in a string.

For example if

my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]

I would want the result:

[2, 3, 1, 0]

I did not find an exact example for this in the documentation, any idea how to accomplish this?

Other O(n) solutions than using ahocorasick would also be appreciated.

4

There are 4 best solutions below

2
On BEST ANSWER

Implementation:

Here's an Aho-Corasick frequency counter:

import ahocorasick

def ac_frequency(needles, haystack):
    frequencies = [0] * len(needles)
    # Make a searcher
    searcher = ahocorasick.Automaton()
    for i, needle in enumerate(needles):
        searcher.add_word(needle, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(haystack):
        frequencies[i] += 1
    return frequencies

(For your example, you'd call ac_frequency(my_list, my_string) to get the list of counts)

For medium-to-large inputs this will be substantially faster than other methods.

Notes:

For real data, this method will potentially yield different results than the other solutions posted, because Aho-Corasick looks for all occurrences of the target words, including substrings.

If you want to find full-words only, you can call searcher.add_word with space/punctuation-padded versions of the original string:

    ...
    padding_start = [" ", "\n", "\t"]
    padding_end = [" ", ".", ";", ",", "-", "–", "—", "?", "!", "\n"]
    for i, needle in enumerate(needles):
        for s, e in [(s,e) for s in padding_start for e in padding_end]:
            searcher.add_word(s + needle + e, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(" " + haystack + " "):
    ...
1
On

You can use list comprehensions to count the number of times the specific list occurs in my_string:

[my_string.split().count(i) for i in my_list]
[2, 3, 1, 0]
0
On

You can use a dictionary to count the occurrences of the words you care about:

counts = dict.fromkeys(my_list, 0) # initialize the counting dict with all counts at zero

for word in my_string.split():
    if word in counts:     # this test filters out any unwanted words
        counts[word] += 1  # increment the count

The counts dict will hold the count of each word. If you really do need a list of counts in the same order as the original list of keywords (and the dict won't do), you can add a final step after the loop has finished:

results = [counts[word] for word in my_list]
4
On

The Counter in the collections module may be of use to you:

from collections import Counter

my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]

counter = Counter(my_string.split(' '))
[counter.get(item, 0) for item in my_list]

# out: [2, 3, 1, 0]