Extract text from Word document and store in an excel file using python

151 Views Asked by At

I have a word document that is a standard template used for our wider meetings. There are two columns in my word document, First Column has all the headers second column holds the actual details of that. I have attached a screenshot of the same to show the structure of the word document.enter image description here

I would now like to extract the text from both of the columns using python and store them in a dataframe. The resultant dataframe should look like the following:

Title     In Force?   Date          Who attended the event? 
Test      Yes         03/10/1999    X, Y

How can I achieve this?

1

There are 1 best solutions below

0
Bushmaster On BEST ANSWER

Here is the parser from abdulsaboor's answer:

def get_table_from_docx(document):
    tables = []
    for table in document.tables:
        df = [['' for i in range(len(table.columns))] for j in range(len(table.rows))]
        for i, row in enumerate(table.rows):
            for j, cell in enumerate(row.cells):
                if cell.text:
                    df[i][j] = cell.text
        tables.append(pd.DataFrame(df))
    return tables   #It returns list of DataFrames

Then:

df = get_table_from_docx(document)[0]
df = df.set_index(0).T # Use transpose.
df["Who attended the event?"] = df["Who attended the event?"].str.replace("\n",", ") #bullets appears "/n". Let's replace it with comma.

Out:

0 Title In Force?    Date          Who attended the event?
1 Test  Yes          03/10/1999    X, Y

Note: If you have multiple tables in doc you can use this:

df_list = get_table_from_docx(document)
final_df = pd.DataFrame()
for i in df_list:
    i = i.set_index(0).T
    i["Who attended the event?"] = i["Who attended the event?"].str.replace("\n",", ") # you can do this outside the loop.
    final_df = pd.concat([final_df,i])