I am trying to access all text within a word document, and then modify it. This is for a broader project. So far I have been able to access most of the paragraphs, tables etc. Now I can not figure out a way to modify the contents of text boxes.
I am using the Python docx library to do this task till now. I also used the underlying lxml package to take care of the pictures and shapes, I am sharing a code snippet that takes care of paragraphs and the underlying pictures,as an example I am changing all the content of the file to a string of 'a' and the code works fine for most text. How to access the text box???
for paragraph in doc.paragraphs:
for run in paragraph.runs:
if run.text:
old_text = run.text
lst_of_words = old_text.split(' ')
n_w_l = []
for i in lst_of_words:
l = len(i)
n_w = ''
for j in range(l):
n_w += 'a'
n_w_l.append(n_w)
new_text = ' '.join(n_w_l)
# Replace each word with its corresponding 'a's
#for old_word, new_word in zip(lst_of_words, n_w_l):
# old_text = old_text.replace(old_word, new_word, 1)
run.text=run.text.replace(old_text,new_text)
print(new_text)
# Replace the text of the run
#run.text = old_text
elif run._r.getchildren()[0].tag.endswith('drawing'):
# If the run contains an image
drawing = run._r.getchildren()[0]
inline = drawing.getchildren()[0]
blip = inline.getchildren()[0]
img_bytes = blip._blob
width = inline.extent.cx
height = inline.extent.cy
new_run = paragraph.add_run()
new_run.add_picture(img_bytes, width=width, height=height)
# Remove the original run
paragraph._p.remove(run._r)
TextBoxes are usually embeded in
mc:AlternateContentand<w:drawing>Elements (among others).You could use
xpath('.//wps:txbx//w:txbxContent')or similar to get them and then access the text within.The following example omits several elements and shows the position of
wps:txbx: