I have to batch edit around 800 word documents, specifically update the dates in the footer. Some of these footers have the dates typed in normal text, but some pages have the date contained in a content control box. My code isn't working on these boxes, and the API documentation for this module doesn't outline how to handle content control boxes. I tried treating them as inline shapes, but that's not working either. In word, I can right click on the box and under "edit field", change the datatype from date to text. I can't possibly do that for 800+ documents, so looking for pythonic way to deal with this!
pattern = r'\b\w{2}-\w{3}-\w{4}\b'
pattern2 = r'\b\w{2}-\w{3}-\w{2}\b'
for filename in os.listdir(word_docs_folder):
if filename.endswith('.docx'):
doc_path = os.path.join(word_docs_folder, filename)
if filename in filename_to_data:
new_date = str(filename_to_data[filename])
doc = Document(doc_path)
for section in doc.sections:
footer = section.footer
for paragraph in footer.paragraphs:
for run in paragraph.runs:
if re.search(pattern, run.text):
run.text = re.sub(pattern, new_date, run.text)
for paragraph in section.first_page_footer.paragraphs:
for run in paragraph.runs:
if re.search(pattern2, run.text):
run.text = re.sub(pattern, new_date, run.text)
for cc in doc.inline_shapes:
content=cc.text
if pattern2 in content:
content = content.replace(pattern2, new_date)
cc.text = content
I'm basically matching the date pattern and replacing it with dates from an excel sheet. This is working on all the pages not containing those godforsaken boxes. Ideally, I would like to delete the box but keep the text