Error when trying to extract text from word using python?

68 Views Asked by At

I'm currently trying to write a function in Python that will allow me to extract text from .docx files. For this I use the python-docx library. My program also does what it's supposed to do, at least when I create a docx file in Python and then use my function on this file it returns the text to me.

However, for .docx files (word documents) that I have modified or created, it cannot find the path and returns PackageNotFoundError. I came across the Internet to check whether my file is a zip file. I did this with zipfile and in fact my saved word documents are not zipfiles. What's going on? My python code again for verification:

from zipfile import is_zipfile
import docx

doc = docx.Document()

doc.add_paragraph("Hello")

doc.save(test_path)

print(is_zipfile(test_path))



//output = true

If I then go into this test_path, type a number and save ->

print(is_zipfile(test_path))
//output = false

Are modern .docx documents no longer zip files? Or what wrong here?

When googling everywhere is written that word documents/.docx files are zip files. I think that is the problem why the libary gives me the error code and cannot open the file. I appreciate everyone trying to help. Thanks

1

There are 1 best solutions below

0
someone On

If you want more control over the final document, or if you want to change an existing document, you need to open one with a filename:

document = Document('existing-document-file.docx')
document.save('new-file-name.docx')
  • You can open any Word 2007 or later file this way (.doc files from Word 2003 and earlier won’t work). While you might not be able to manipulate all the contents yet, whatever is already in there will load and save just fine. The feature set is still being built out, so you can’t add or change things like headers or footnotes yet, but if the document has them python-docx is polite enough to leave them alone and smart enough to save them without actually understanding what they are.
  • If you use the same filename to open and save the file, python-docx will obediently overwrite the original file without a peep. You’ll want to make sure that’s what you intend.