Python-docx confuses upper and lowercase in extracted text portions

92 Views Asked by Eugen At 10 August 2023 at 16:17

I have encountered some problems in preserving the upper case of the text extracted from a .docx document with python-docx package.

I am iterating over the paragraphs of a .docx document in order to extract the text using python-docx:

for paragraph in doc.paragraphs:
    raw_text = paragraph.text

That is quite straightforward. But when I start comparing the .docx source to its extracted text in the variable raw_text, I often (yet not always) find out that the uppercase symbols of the former have become lowercase in the latter, just like in the following case:

(source) ПОРЯ́ДОК, дка и дку, м. 1. ед. Состояние благоустроенности, гармонии; правильное расположение, надлежащий вид чего-л.

(raw_text) Поря́док, дка и дку, м. 1. ед. Состояние благоустроенности, гармонии; правильное расположение, надлежащий вид чего-л. Порядок скучен вездѣ, и немножко труден.

I can't just figure out where the problem lies and I shall be most grateful if anyone could explain this strange effect.

Original Q&A

There are 1 best solutions below

Eugen On 29 August 2023 at 20:29

I've finally detected the source of the problem. Some seemingly uppercase symbols are in fact formatted in all capitals. So my solution is to check the font properties in python-docx for each run via paragraph.run.font.all_caps.

Python-docx confuses upper and lowercase in extracted text portions

There are 1 best solutions below

Related Questions in PYTHON-DOCX

Related Questions in UPPERCASE

Related Questions in LOWERCASE

Trending Questions

Popular # Hahtags

Popular Questions