How to access text box in a word document with Python?

86 Views Asked by At

I am trying to access all text within a word document, and then modify it. This is for a broader project. So far I have been able to access most of the paragraphs, tables etc. Now I can not figure out a way to modify the contents of text boxes.

I am using the Python docx library to do this task till now. I also used the underlying lxml package to take care of the pictures and shapes, I am sharing a code snippet that takes care of paragraphs and the underlying pictures,as an example I am changing all the content of the file to a string of 'a' and the code works fine for most text. How to access the text box???

for paragraph in doc.paragraphs:
    for run in paragraph.runs:
        if run.text:
            old_text = run.text
            lst_of_words = old_text.split(' ')
            
            n_w_l = []
            for i in lst_of_words:
                l = len(i)
                n_w = ''
                for j in range(l):
                    n_w += 'a'
                n_w_l.append(n_w)
            
            new_text = ' '.join(n_w_l)
            
            # Replace each word with its corresponding 'a's
            #for old_word, new_word in zip(lst_of_words, n_w_l):
            #   old_text = old_text.replace(old_word, new_word, 1)

            run.text=run.text.replace(old_text,new_text)
            print(new_text)
            
            # Replace the text of the run
            #run.text = old_text
        elif run._r.getchildren()[0].tag.endswith('drawing'):
            
            # If the run contains an image
            drawing = run._r.getchildren()[0]
            inline = drawing.getchildren()[0]
            blip = inline.getchildren()[0]
            img_bytes = blip._blob
            width = inline.extent.cx
            height = inline.extent.cy
            new_run = paragraph.add_run()
            new_run.add_picture(img_bytes, width=width, height=height)

            # Remove the original run
            paragraph._p.remove(run._r)
1

There are 1 best solutions below

4
wolfrevo On

TextBoxes are usually embeded in mc:AlternateContent and <w:drawing> Elements (among others).

You could use xpath('.//wps:txbx//w:txbxContent') or similar to get them and then access the text within.

The following example omits several elements and shows the position of wps:txbx:

<w:p w:rsidR="00CC64EA" w:rsidRDefault="00CC64EA" w:rsidP="00CC64EA">
  <w:r>
    <mc:AlternateContent>
      <mc:Choice Requires="wps">
        <w:drawing>
          <wp:anchor distT="45720" distB="45720" distL="114300" distR="114300"
            simplePos="0" relativeHeight="251659264" behindDoc="0" locked="0"
            layoutInCell="1" allowOverlap="1">
            <a:graphic
               xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
              <a:graphicData
                 uri="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
                <wps:wsp>
                  <wps:txbx>
                    <w:txbxContent>
                      <w:p w:rsidR="00CC64EA"
                        w:rsidRDefault="00CC64EA">
                        <w:r>
                          <w:t>Content of TextBox</w:t>
                        </w:r>
                      </w:p>
                    </w:txbxContent>
                  </wps:txbx>
                </wps:wsp>
              </a:graphicData>
            </a:graphic>
          </wp:anchor>
        </w:drawing>
      </mc:Choice>
    </mc:AlternateContent>
  </w:r>
</w:p>