Python-docx confuses upper and lowercase in extracted text portions

92 Views Asked by At

I have encountered some problems in preserving the upper case of the text extracted from a .docx document with python-docx package.

I am iterating over the paragraphs of a .docx document in order to extract the text using python-docx:

for paragraph in doc.paragraphs:
    raw_text = paragraph.text

That is quite straightforward. But when I start comparing the .docx source to its extracted text in the variable raw_text, I often (yet not always) find out that the uppercase symbols of the former have become lowercase in the latter, just like in the following case:

(source) ПОРЯ́ДОК, дка и дку, м. 1. ед. Состояние благоустроенности, гармонии; правильное расположение, надлежащий вид чего-л.

(raw_text) Поря́док, дка и дку, м. 1. ед. Состояние благоустроенности, гармонии; правильное расположение, надлежащий вид чего-л. Порядок скучен вездѣ, и немножко труден.

I can't just figure out where the problem lies and I shall be most grateful if anyone could explain this strange effect.

1

There are 1 best solutions below

1
Eugen On

I've finally detected the source of the problem. Some seemingly uppercase symbols are in fact formatted in all capitals. So my solution is to check the font properties in python-docx for each run via paragraph.run.font.all_caps.