Extract/search for wanted string/text within html tag using Python

45 Views Asked by At

I have a html file where consists of 400 html tags and I wanted to extract some specific text from the tag. This file is a local file and not online webpage. I just tried using 1 html file first to check and confirm on the logic. In the real requirement, I will run it with a batch of html files (more than 50 html files).

What I want to extract is any text that sit between these tag:

Text I wanted

And in the html file, this tag might be used more than 1.

I did try to extract the text from the file using this code

 global count
 with open(file_path, 'r', encoding ="utf8") as fp:
          
            lines = fp.readlines() 
            text= '<div class="th-choice-list-name headerViewModeElementLoc th-choice-list-   description-value">'

            for line in lines:                              
                
                if line.find(text) != -1:                       
                        count = count + 1
                        result = re.search('<div class="th-choice-list-name headerViewModeElementLoc th-           choice-list-   description-value">(.*)</div>', line)                       
                        print(result.group(1))
                        print(count)
                            

And my problem are:

  1. It only can identify for the first search for the line.find(...) but not for the next similar tag.
  2. It can't extract the exact text I wanted because there is repetitive ' ... ' tag in the input file so it will take the whole line of html code that started with the first <div class="th-choice-list-name headerViewModeElementLoc ... and ended with any

This will be the 'simplified' version of the html file as the input (Bolded are the text that I want)

  </div><div id="th-templateEditor-section17-header" class="th-section" componentid="17"><div id="th-17-button_submenu" class="x-btn button_submenu inline_div x-btn-default-small"><div class="th-choice-list-name headerViewModeElementLoc th-choice-list-description-value">**Text I wanted 1**</div><em id="th-17-button_submenu-btnWrap" class=""><button id="th-17-button_submenu-btnEl" type="button" hidefocus="true" role="button" autocomplete="off" title="Menu" class="x-btn-center" aria-label="Menu"><span id="th-17-button_submenu-btnInnerEl" class="x-btn-inner" style="">&nbsp;</span><span id="th-17-button_submenu-btnIconEl" class="x-btn-icon  x-hide-display">&nbsp;</span></button></em></div><div class="th-choice-list-name headerViewModeElementLoc th-choice-list-description-value">**Text I wanted 2**</div><div id="th-templateEditor-section17-header-invalid-message" class="invalidElementMessage">
1

There are 1 best solutions below

0
tax evader On

As suggested by @Barmar, I think you should use third-party library like BeautifulSoup parse and find tags with given criteria using find_all() as it simplifies parsing and searching easier than using regex

from bs4 import BeautifulSoup

html = '''
<div id="th-templateEditor-section17-header" class="th-section" componentid="17"><div id="th-17-button_submenu" class="x-btn button_submenu inline_div x-btn-default-small"><div class="th-choice-list-name headerViewModeElementLoc th-choice-list-description-value">**Text I wanted 1**</div><em id="th-17-button_submenu-btnWrap" class=""><button id="th-17-button_submenu-btnEl" type="button" hidefocus="true" role="button" autocomplete="off" title="Menu" class="x-btn-center" aria-label="Menu"><span id="th-17-button_submenu-btnInnerEl" class="x-btn-inner" style="">&nbsp;</span><span id="th-17-button_submenu-btnIconEl" class="x-btn-icon  x-hide-display">&nbsp;</span></button></em></div><div class="th-choice-list-name headerViewModeElementLoc th-choice-list-description-value">**Text I wanted 2**</div><div id="th-templateEditor-section17-header-invalid-message" class="invalidElementMessage"></div>
'''

search_classes = 'th-choice-list-name headerViewModeElementLoc th-choice-list-description-value'.split(' ')

parsed_html = BeautifulSoup(html, "html.parser")
divs = parsed_html.find_all('div', {'class': search_classes})

for div in divs:
    print(div.text)