How can I extract text in bullet points from a Word file (using the python docx library)?

532 Views Asked by At

I am writing a python program for which I want to be able to print the text that is in a Word file (let's call that the lecture file) and is formatted using the bullet points function from Word. I tried this using the docx library, but I never quite managed. I also tried to figure out a way using ChatGPT, but even then I couldn't get it to work. (I am using version 2302 of Word) I then tried to write bullet points to a Word file (let's call that the writing file), to see if that would help me get it to work. Writing text to a bullet points list in a Word file has proven to be rather easy. However, I noticed that the bullet points did not seem to have a character selected in the Word bullet points options. When I then tried to print the text from that Word file that was formatted using the bullet points, it worked like a charm. With my own lecture file however, there was a character selected (the coloured circle) in the bullet points opionts. And the same code that was able to print the text in bullet points from the writing file, did not work for my lecture file. So I think it has something to do with that. Also, when I print the style of the text from the lecture file (using print(paragraph.style.name), it says 'no spacing' for all the text (/for all paragraphs). When I do that for the writing file, it does say different styles, like the 'List Bullet' I was expecting (and need to correctly identify what text is in bullet points).

First of all, I read the text from my Word file using the following code:

doc = Document(file)
for para in doc.paragraphs:
   

I then tried different ways of identifying whether text was in bullet points, like: if para.style.name == 'List Bullet': or if para.style.name.startswith('List'):. I also tried with specific bullet points characters, but all to no avail. Which, makes sense because, in the lecture file, the para.style.name always said that the text was in the 'no spacing' style, regardless of bullet points in the text. Even though the bullet points were created by clicking on the bullet points option in Word. Since the bullet points in the writing file were identified as 'List Bullet' by the para.style.name, I don't understand why that is not the case for the lecture file. Also, I don't know how to get my lecture file bullet points formatted correctly so my code can recognize it and print that text. Or the other way around, I don't understand how to write my code so that it can recognize bullet points in Word and print that text.

The word file can be found here (https://docs.google.com/document/d/1640UBvbHW0QP1SUii2mK72H8-Zs3hR_b/edit?usp=share_link&ouid=102372228670097852864&rtpof=true&sd=true)

0

There are 0 best solutions below