Webscrapping a page can't find text content regardless of lowercase or uppercase

23 Views Asked by At

I've been trying to webscrap a page, but when I want to filter the information regardless of a 100% match (uppercase, lowercase, etc) I can't get it to work.

import requests
from bs4 import BeautifulSoup
URL = "https://www.pemex.com/procura/procedimientos-de-contratacion/concursosabiertos/Paginas/Pemex-Transformación-Industrial.aspx"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="MSOZoneCell_WebPartWPQ4")


texto_licitacion = results.find_all("td", string=lambda text: "Bienes" in text.lower())

And I get these results:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\acast\AppData\Roaming\Python\Python311\site-packages\bs4\element.py", line 2030, in find_all
    return self._find_all(name, attrs, string, limit, generator,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acast\AppData\Roaming\Python\Python311\site-packages\bs4\element.py", line 841, in _find_all
    found = strainer.search(i)
            ^^^^^^^^^^^^^^^^^^
  File "C:\Users\acast\AppData\Roaming\Python\Python311\site-packages\bs4\element.py", line 2320, in search
    found = self.search_tag(markup)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acast\AppData\Roaming\Python\Python311\site-packages\bs4\element.py", line 2291, in search_tag
    if found and self.string and not self._matches(found.string, self.string):
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acast\AppData\Roaming\Python\Python311\site-packages\bs4\element.py", line 2352, in _matches
    return match_against(markup)
           ^^^^^^^^^^^^^^^^^^^^^
  File "<stdin>", line 2, in <lambda>
AttributeError: 'NoneType' object has no attribute 'lower'

I already tried in another webpage and it worked correctly, but in this one I can't.

2

There are 2 best solutions below

0
Barmar On

Some of the elements have no text, so text is None. Check for that in your filter.

You also need to check for bienes, since text.lower() can't have an uppercase B.

texto_licitacion = results.find_all("td", string=lambda text: text and "bienes" in text.lower())
0
Howsikan On

Your lambda function will always return False as a value since the phrase you're looking for, "Bienes", has a capital letter "B". Because of this, the string keyword argument you want to pass into the find_all() function isn't passing a string but instead a boolean value (which in this case will always be False.

If you want to find all the <td> elements that contain the word "bienes" (lowercase), the correct function call would be:

def bienes_in_tag(tag):
    return 'bienes' in tag.lower()


texto_licitacion = results.find_all("td", bienes_in_tag)
for tag in texto_licitacion:
    print(tag)

Here, I defined a function that returns a boolean. BeautifulSoup lets you use these functions to look for all the tags which have the phrase 'bienes' in them.