Extract Acronyms and Māori (non-english) words in a dataframe, and put them in adjacent columns within the dataframe

178 Views Asked by At

Regular expression seems a steep learning curve for me. I have a dataframe that contains texts (up to 300,000 rows). The text as contained in outcome column of a dummy file named foo_df.csv has a mixture of English words, acronyms and Māori words. foo_df.csv is as thus:

    outcome
0   I want to go to DHB
1   Self Determination and Self-Management Rangatiratanga
2   mental health wellness and AOD counselling
3   Kai on my table
4   Fishing
5   Support with Oranga Tamariki Advocacy
6   Housing pathway with WINZ
7   Deal with personal matters
8   Referral to Owaraika Health services

The result I desire is in form of a table below such that has Abreviation and Māori_word columns:

    outcome                                                 Abbreviation     Māori_word             
0   I want to go to DHB                                     DHB      
1   Self Determination and Self-Management Rangatiratanga                    Rangatiratanga
2   mental health wellness and AOD counselling              AOD              
3   Kai on my table                                                          Kai
4   Fishing                                                                  
5   Support with Oranga Tamariki Advocacy                                    Oranga Tamariki
6   Housing pathway with WINZ                               WINZ             
7   Deal with personal matters                                               
8   Referral to Owaraika Health services                                     Owaraika

The approach I am using is to extract the ACRONYMS using regular expression and extract the Māori words using nltk module.

I have been able to extract the ACRONYMS using regular expression with this code:

pattern = '(\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b)'
foo_df['Abbreviation'] = foo_df.outcome.str.extract(pattern)

I have been able to extract non-english words from a sentence using the code below:

import nltk
nltk.download('words')
from nltk.corpus import words

words = set(nltk.corpus.words.words())

sent = "Self Determination and Self-Management Rangatiratanga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if not w.lower() in words or not w.isalpha())

However, I got an error TypeError: expected string or bytes-like object when I tried to iterate the above code over a dataframe. The iteration I tried is below:

def no_english(text):
  words = set(nltk.corpus.words.words())
  " ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
         if not w.lower() in words or not w.isalpha())

foo_df['Māori_word'] = foo_df.apply(no_english, axis = 1)
print(foo_df)
   

Any help in python3 will be appreciated. Thanks.

1

There are 1 best solutions below

0
mozway On

You can't magically tell if a word is English/Māori/abbreviation with a simple short regex. Actually, it is quite likely that some words can be found in multiple categories, so the task itself is not binary (or trinary in this case).

What you want to do is natural language processing, here are some examples of libraries for language detection in python. What you'll get is a probability that the input is in a given language. This is usually ran on full texts but you could apply it to single words.

Another approach is to use Māori and abbreviation dictionaries (=exhaustive/selected lists of words) and craft a function to tell if a word is one of them and assume English otherwise.