How to detect incorrect spellings in a text file using Python?

165 Views Asked by At

I am doing an exercise where I have to find out what are the incorrect spellings present in the text dataset using Python. I have checked multiple blogs but all of them show how to autocorrect incorrect spellings. I don't want to autocorrect it, I just want to separate the incorrect spellings from the dataset.

Sample Dataset:

1. Kurtas for women
2. parti wear dresses
3. denim jeans
4. overcot

Expected Output:

1. parti wear dresses
2. overcot
3

There are 3 best solutions below

0
Timeless On BEST ANSWER

By using , at each line, you can check if any of their words are unknown and if so, keep the line and write it to a new file. Eventually, you can also load_words (custom ones like Kurtas) to the dictionary in order to not be flagged as "misspeled".

#pip install from spellchecker
from spellchecker import SpellChecker

sp = SpellChecker() #language="en" by default

# add on more custom words if needed 
sp.word_frequency.load_words(["Kurtas"])
    
with (
    open("file.txt", "r") as in_f,
    open("newf.txt", "w") as out_f
):
    for l in in_f:
        if sp.unknown(l.split()):
            out_f.write(l)

Output (newf.txt) :

parti wear dresses
overcot
0
Ori Yarden PhD On

We can use nltk's words which contains a list of 236736 words in lower-case:

import nltk
nltk.download('words')
from nltk.corpus import words
correct_words = words.words()

some_words = ['Bike', 'happy', 'woman', 'parti']
incorrect_spelling = []
correct_spelling = []
for _word in some_words:
    if _word.lower() not in correct_words:
        incorrect_spelling.append(_word.lower())
    else:
        correct_spelling.append(_word.lower())

print(f'correct: {correct_spelling}')
print(f'incorrect: {incorrect_spelling}')

Outputs:

correct: ['bike', 'happy', 'woman']
incorrect: ['parti']
0
lucs100 On

Use the pyspellchecker library and have it correct each line. Then compare each original line to the corrected line. If the lines are not equal, a correction has been made.

from spellchecker import SpellChecker

spell = SpellChecker()

def isLineCorrect(line):
    correctedLine = []
    for word in line:
        correctedLine.append(spell.correction(word))
    return (line == correctedLine)

>>> isLineCorrect(["kurtas", "for", "women"])
True

>>> isLineCorrect(["parti", "wear", "dresses"])
False

You can split a sentence into words using .split() on a string:

sample = "kurtas for women"
sampleList = sample.split()
#sampleList is now ["kurtas", "for", "women"]