NoneType error when trying to access .text attribute of an existent <a> element

78 Views Asked by At

I am using BeautifulSoup to scrape the first wikitable on the page List of military engagements during the Russian invasion of Ukraine to get the names of all 57 battles. I have attached an image of the table's HTML for reference: HTML of the wikitable.

To grab all the <a> elements in the first column and get just the text (the battle names), I did the following:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_military_engagements_during_the_Russian_invasion_of_Ukraine'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('table')
rows = table.find_all('tr')

battlenames = []
for row in rows:
    # Find the first <td> element within the row
    td_element = row.find('td')
    if td_element:
        # Find the first <a> element within the <td> element
        battlename = td_element.find('a')
        cleanname = battlename.text
        battlenames.append(cleanname)

for name in battlenames:
    print(name)

I ran this in both Spyder and Jupyter Notebook and got the following error:

AttributeError                            Traceback (most recent call last)
Cell In[6], line 18
     15     if td_element:
     16         # Find the first <a> element within the <td> element
     17         battlename = td_element.find('a')
---> 18         cleanname = battlename.text
     19         battlenames.append(cleanname)
     21 for name in battlenames:

AttributeError: 'NoneType' object has no attribute 'text'

This surprised me because the first <td> element of every row (<tr>) contains an <a> element with the battle name. I.e., there are no empty boxes in the table's first column that would cause a NoneType error. What could be the issue?

1

There are 1 best solutions below

2
HedgeHog On BEST ANSWER

EDIT

Based on comment from @Ouroboros1 to be more precise, the issue is exactly, that there are elements of td that do not contain a a.

table contains one "sub" tr for "Battles of Voznesensk", where the first td fills "9 March 2022" in the "Start date" column. Now, this td just happens to have no link a

So you have also to check if there is an a before calling .text:

if td_element:
    # Find the first <a> element within the <td> element
    battlename = td_element.find('a')
    # check hier if also a is available
    if battlename:
        cleanname = battlename.text
        battlenames.append(cleanname)

You could also try to change your selection strategy, may use css selectors to select only tr with td that contains a:

soup.table.select('tr:has(td:first-of-type a)')

or even directly all a in first td of tr:

soup.table.select('tr td:first-of-type a')

Example css selectors

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_military_engagements_during_the_Russian_invasion_of_Ukraine'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')

#Option A

for row in soup.table.select('tr:has(td:first-of-type a)'):
        print(row.td.a.text)

#Option B
for a in soup.table.select('tr td:first-of-type a'):
    print(a.text)