Regular expression seems a steep learning curve for me. I have a dataframe that contains texts (up to 300,000 rows). The text as contained in outcome column of a dummy file named foo_df.csv has a mixture of English words, acronyms and Māori words. foo_df.csv is as thus:
outcome
0 I want to go to DHB
1 Self Determination and Self-Management Rangatiratanga
2 mental health wellness and AOD counselling
3 Kai on my table
4 Fishing
5 Support with Oranga Tamariki Advocacy
6 Housing pathway with WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services
The result I desire is in form of a table below such that has Abreviation and Māori_word columns:
outcome Abbreviation Māori_word
0 I want to go to DHB DHB
1 Self Determination and Self-Management Rangatiratanga Rangatiratanga
2 mental health wellness and AOD counselling AOD
3 Kai on my table Kai
4 Fishing
5 Support with Oranga Tamariki Advocacy Oranga Tamariki
6 Housing pathway with WINZ WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services Owaraika
The approach I am using is to extract the ACRONYMS using regular expression and extract the Māori words using nltk module.
I have been able to extract the ACRONYMS using regular expression with this code:
pattern = '(\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b)'
foo_df['Abbreviation'] = foo_df.outcome.str.extract(pattern)
I have been able to extract non-english words from a sentence using the code below:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "Self Determination and Self-Management Rangatiratanga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
However, I got an error TypeError: expected string or bytes-like object when I tried to iterate the above code over a dataframe. The iteration I tried is below:
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
foo_df['Māori_word'] = foo_df.apply(no_english, axis = 1)
print(foo_df)
Any help in python3 will be appreciated. Thanks.
You can't magically tell if a word is English/Māori/abbreviation with a simple short regex. Actually, it is quite likely that some words can be found in multiple categories, so the task itself is not binary (or trinary in this case).
What you want to do is natural language processing, here are some examples of libraries for language detection in python. What you'll get is a probability that the input is in a given language. This is usually ran on full texts but you could apply it to single words.
Another approach is to use Māori and abbreviation dictionaries (=exhaustive/selected lists of words) and craft a function to tell if a word is one of them and assume English otherwise.