Webscrapping a page can't find text content regardless of lowercase or uppercase

Question

Webscrapping a page can't find text content regardless of lowercase or uppercase

23 Views Asked by Arturo At 18 August 2023 at 02:20

I've been trying to webscrap a page, but when I want to filter the information regardless of a 100% match (uppercase, lowercase, etc) I can't get it to work.

import requests
from bs4 import BeautifulSoup
URL = "https://www.pemex.com/procura/procedimientos-de-contratacion/concursosabiertos/Paginas/Pemex-Transformación-Industrial.aspx"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="MSOZoneCell_WebPartWPQ4")


texto_licitacion = results.find_all("td", string=lambda text: "Bienes" in text.lower())

And I get these results:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\acast\AppData\Roaming\Python\Python311\site-packages\bs4\element.py", line 2030, in find_all
    return self._find_all(name, attrs, string, limit, generator,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acast\AppData\Roaming\Python\Python311\site-packages\bs4\element.py", line 841, in _find_all
    found = strainer.search(i)
            ^^^^^^^^^^^^^^^^^^
  File "C:\Users\acast\AppData\Roaming\Python\Python311\site-packages\bs4\element.py", line 2320, in search
    found = self.search_tag(markup)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acast\AppData\Roaming\Python\Python311\site-packages\bs4\element.py", line 2291, in search_tag
    if found and self.string and not self._matches(found.string, self.string):
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acast\AppData\Roaming\Python\Python311\site-packages\bs4\element.py", line 2352, in _matches
    return match_against(markup)
           ^^^^^^^^^^^^^^^^^^^^^
  File "<stdin>", line 2, in <lambda>
AttributeError: 'NoneType' object has no attribute 'lower'

I already tried in another webpage and it worked correctly, but in this one I can't.

Original Q&A

There are 2 best solutions below

**Barmar** · Answer 1 · 2023-08-18T02:44:22.917000

Some of the elements have no text, so text is None. Check for that in your filter.

You also need to check for bienes, since text.lower() can't have an uppercase B.

texto_licitacion = results.find_all("td", string=lambda text: text and "bienes" in text.lower())

**Howsikan** · Answer 2 · 2023-08-18T02:58:26.810000

Your lambda function will always return False as a value since the phrase you're looking for, "Bienes", has a capital letter "B". Because of this, the string keyword argument you want to pass into the find_all() function isn't passing a string but instead a boolean value (which in this case will always be False.

If you want to find all the <td> elements that contain the word "bienes" (lowercase), the correct function call would be:

def bienes_in_tag(tag):
    return 'bienes' in tag.lower()


texto_licitacion = results.find_all("td", bienes_in_tag)
for tag in texto_licitacion:
    print(tag)

Here, I defined a function that returns a boolean. BeautifulSoup lets you use these functions to look for all the tags which have the phrase 'bienes' in them.

Webscrapping a page can't find text content regardless of lowercase or uppercase

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in FINDALL

Trending Questions

Popular # Hahtags

Popular Questions