How Can I Extract a Value from HTML?

68 Views Asked by At

I am building automation to deal with email alerts that we receive. The final step is to be able to extract the username involved in the alert and from research it looked like it should be fairly easy to extract this from the original email. See below for a snippet of the HTML I am attempting to extract the value from.

HTML Snippet:

<tr>
    <td style="border:solid #DBDCDC 1.0pt;padding:3.75pt 3.75pt 3.75pt 3.75pt">
        <p class="MsoNormal" align="right" style="text-align:right">
        <span style="font-size:9.0pt;color:black">source_username
        </span>
        <span style="font-size:9.0pt">
            <o:p></o:p>
        </span>
        </p>
    </td>
    <td width="100%" style="width:100.0%;border:solid #DBDCDC 1.0pt;border-left:none;background:#FAFAFA;padding:3.75pt 3.75pt 3.75pt 3.75pt;max-width:100%">
        <p class="MsoNormal">
        <span style="font-size:9.0pt;color:black">ServicePrincipal_64e90aaf-abe7-4fa8-b0f7-a56db5a780bc
        </span>
        <span style="font-size:9.0pt">
            <o:p></o:p>
        </span>
        </p>
    </td>
</tr>

I have made two attempts with Parsel and BeautifulSoup, both of which didn't work.

Parsel attempt:

sel = Selector(text=html)

# Find the td tag that contains the 'source_username' string
source_tag = sel.xpath('//td[contains(.//text(), "source_username")]')[0]
print(source_tag)
# Extract the value from the tag
source_username = source_tag.xpath('./following-sibling::td[1]//text()').get().strip()

print(source_username)

BeautifulSoup attempt:

soup = BeautifulSoup(html, 'html.parser')

# Find the tag that contains the source_username
source_tag = soup.find('td', string='source_username')
print(source_tag)

# Extract the value from the tag
source_username = source_tag.find_next_sibling('td').text.strip()

print(source_username)
2

There are 2 best solutions below

0
MrXQ On

If you have access t othe HTML :

you can pass data attribute with the desired value something like this :


<span data-username="source_username" style="font-size:9.0pt;color:black">source_username</span>

and get the data using JS like this :

// Get the span element
const spanElement = document.querySelector('span[data-username]');

// Get the data-username attribute value
const username = spanElement.getAttribute('data-username');
console.log(username);


If you don't have

try something like this :

// Get the span element
const spanElement = document.querySelector('span');

// Get the text content of the span
const username = spanElement.textContent || spanElement.innerText;

// Log the username to the console
console.log(username);
0
DRA On

You can do something like this:


def contains_text(tag):
   return tag and tag.name =='td' and 'source_username' in tag.get_text(strip=True)

found_tag = soup.find(contains_text)
source_username = source_tag.find_next_sibling('td').text.strip()