Why does OpenXML sometimes read tables as single cells?

36 Views Asked by At

I am currently writing a program to read in tables from word documents to store them in an excel file. I am running into an issue where sometimes the table in a document will read in as having only 1 row and 1 column. In word, these tables appear and function the same as any others. There are multiple tables in these documents, and for those documents that are affected, it is not all of the tables.

My code is below.

using (WordprocessingDocument doc = WordprocessingDocument.Open(workOrder, false)) 
{

List<System.Data.DataTable> tableList = new List<System.Data.DataTable>();
    string customer = doc.MainDocumentPart.Document.Body.Elements<Table>().ElementAt(0).Elements<TableRow>().ElementAt(16).Elements<TableCell>().ToArray()[1].InnerText;

                    
    //Go through each workorder and find the tables that contain the needed info
    foreach (Table table in doc.MainDocumentPart.Document.Body.Elements<Table>())
    {
        TableRow row = table.Elements<TableRow>().ElementAt(0);
        //Sometimes there is an invisible row that is empty so check the next row if that is the case
        try {
            if (row.Elements<TableCell>().ToArray()[0].InnerText == "" && row.Elements<TableCell>().ToArray()[1].InnerText == "" && table.Elements<TableRow>().Count<TableRow>() > 1)
            {
                row = table.Elements<TableRow>().ElementAt(1);
            }
            List<Table> chkTables = new List<Table>();
            //Check the row to see if this is the right table
            foreach (TableCell cell in row.Elements<TableCell>())
            {
                if (((cell.InnerText.ToLower().ToString().Contains("phase") && row.Elements<TableCell>().Count() >= 4) || row.Elements<TableCell>().Count() == 5) && !chkTables.Contains(table))
                {
                    tableList.Add(ReadWordTable(workOrder, table));
                    chkTables.Add(table);
                }
            }
        }
    catch {
                        
}`

Most of the time it works fine, and the tables read in as normal. But on some, when I step through the code, I can see that the entire table is read as being a single cell. Some documents there are multiple tables that openXML reads as one table that is 1x1.

1

There are 1 best solutions below

0
rockemsockem On

Thanks to @user246821 who provided me the link to the SDK Productivity tool. Using that, I was able to determine that the issue was the the tables were inside another table tag, as seen in this image. Picture of Word Document XML elements as displayed in the Open XML SDK 2.5 Productivity Tool

I modified my code so that it will run recursively, checking if the number of table elements in the cell is greater than 0. Below is the code as a separate method.

public static void ListTables(List<System.Data.DataTable> tableList, string workOrder, IEnumerable<Table> tables)
    {
        Dictionary<string, string> failedToRead = new Dictionary<string, string>();

        //Go through each workorder and find the tables that contain the needed info
        foreach (Table table in tables)
        {
            TableRow row = table.Elements<TableRow>().ElementAt(0);
            //Sometimes there is an invisible row that is empty so check the next row if that is the case
            try
            {
                if (row.Elements<TableCell>().ToArray()[0].InnerText == "" && row.Elements<TableCell>().ToArray()[1].InnerText == "" && table.Elements<TableRow>().Count<TableRow>() > 1)
                {
                    row = table.Elements<TableRow>().ElementAt(1);
                }
                List<Table> chkTables = new List<Table>();
                //Check the row to see if this is the right table
                foreach (TableCell cell in row.Elements<TableCell>())
                {
                    if (((cell.InnerText.ToLower().ToString().Contains("phase") && row.Elements<TableCell>().Count() >= 4) || row.Elements<TableCell>().Count() == 5) && !chkTables.Contains(table))
                    {
                        tableList.Add(ReadWordTable(workOrder, table));
                        chkTables.Add(table);
                    }
                    if (cell.Elements<Table>().Count() > 0)
                    {
                        ListTables(tableList, workOrder, cell.Elements<Table>());
                    }
                }
            }
            catch (Exception ex)
            {
                failedToRead.Add(workOrder, ex.ToString());
            }

        }
    }