pd.read_html mangles a table header, and adds the entire content of an inner table in the top cell

26 Views Asked by At

This is a minimal example of a webpage that pandas is mangling.

<table>
    <tbody>
        <tr>
            <th>
                <div>
                    <span>Header</span>
                </div>
            </th>
        </tr>
        <tr>
            <td>
                <table>
                    <tbody>
                        <tr>
                            <td>Qu2</td>
                            <td>23-09-13</td> 
                        </tr>
                        <tr>
                           
                            <td>Br</td>
                            <td>R72</td>\n
                    </tbody>
                </table>
            </td>
        </tr>
    </tbody>
</table>

this is what the html looks like enter image description here

however pd.read_html (with all possible flavors 'bs4' 'lxml'...) gives this

enter image description here

As you can see, the entire data is also scrunched up in the first cell.

Is there a workaround? Is this WAI?

1

There are 1 best solutions below

2
Night Train On

You have a nested table as you can see with the nested <table> tags in you HTML. Reading the provided HTML pd.read_html() returns a list with 2 DataFrames. Simply select the 2nd DataFrame which is the inner list.

df = pd.read_html("<HTML STRING>")
df = df[1]

This will give you the DataFrame you are looking for.