pd.read_html mangles a table header, and adds the entire content of an inner table in the top cell

26 Views Asked by vish At 16 February 2024 at 13:04

This is a minimal example of a webpage that pandas is mangling.

<table>
    <tbody>
        <tr>
            <th>
                <div>
                    <span>Header</span>
                </div>
            </th>
        </tr>
        <tr>
            <td>
                <table>
                    <tbody>
                        <tr>
                            <td>Qu2</td>
                            <td>23-09-13</td> 
                        </tr>
                        <tr>
                           
                            <td>Br</td>
                            <td>R72</td>\n
                    </tbody>
                </table>
            </td>
        </tr>
    </tbody>
</table>

this is what the html looks like

however pd.read_html (with all possible flavors 'bs4' 'lxml'...) gives this

As you can see, the entire data is also scrunched up in the first cell.

Is there a workaround? Is this WAI?

Original Q&A

There are 1 best solutions below

Night Train On 19 February 2024 at 09:33

You have a nested table as you can see with the nested <table> tags in you HTML. Reading the provided HTML pd.read_html() returns a list with 2 DataFrames. Simply select the 2nd DataFrame which is the inner list.

df = pd.read_html("<HTML STRING>")
df = df[1]

This will give you the DataFrame you are looking for.

pd.read_html mangles a table header, and adds the entire content of an inner table in the top cell

There are 1 best solutions below

Related Questions in HTML

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in DATA-CONVERSION

Trending Questions

Popular # Hahtags

Popular Questions