Read content of several txt files into python

58 Views Asked by At

I have two folders, each folder contains words in various .txt files, one folder is named 'good' while the other is named 'bad', I want to write a python script that will import all the data into a dataframe and the dataframe will have 'Id' column, 'word' column and 'label' column. The label column will either be 'good' or 'bad' based on the folder name.

I have written the following python script, but i seem to be having issues with file encoding type, I have installed the 'cahrdet' library to detect the file encoding type but i still get this error:

UnicodeDecodeError: 'cp949' codec can't decode byte 0xb7 in position 1400: illegal multibyte sequence
good_path = "myfolder/good"
bad_path = "myfolder/bad"


ids = []
words = []
labels = []


for filename in os.listdir(good_path):
    with open(os.path.join(good_path, filename), "rb") as f:
        result = chardet.detect(f.read())
        encoding = result["encoding"]
    with open(os.path.join(good_path, filename), "r", encoding=encoding) as f:
        word_content = f.read()
        ids.append(filename)
        words.append(word_content)
        labels.append("good")


for filename in os.listdir(bad_path):
    with open(os.path.join(bad_path, filename), "rb") as f:
        result = chardet.detect(f.read())
        encoding = result["encoding"]
    with open(os.path.join(bad_path, filename), "r", encoding=encoding) as f:
        word_content = f.read()
        ids.append(filename)
        words.append(word_content)
        labels.append("bad")

# Create a dataframe from the lists
df = pd.DataFrame({"Id": ids, "words": words, "label": labels})

2

There are 2 best solutions below

3
chrisfang On

You can try setting the encoding to utf-8 directly

python3 is fully supported


for filename in os.listdir(good_path):
    with open(os.path.join(good_path, filename), "r", encoding="utf-8") as f:
        word_content = f.read()
        ids.append(filename)
        words.append(word_content)
        labels.append("good")


for filename in os.listdir(bad_path):
    with open(os.path.join(bad_path, filename), "r", encoding="utf-8") as f:
        word_content = f.read()
        ids.append(filename)
        words.append(word_content)
        labels.append("bad")


0
highclef On

Thank you all, I was able to filter out the text files with (character) the wrong encoding and exclude them with a try-except block to catch the UnicodeDecodeError exception.