AttributeError: 'NoneType' object has no attribute 'text' trying to scrape data in Fbref

86 Views Asked by At

I am doing a project that involves scraping some data from FBref, and im using a code i found on git that does exactly what i want. This is the code:

def get_tables(url):
    res = requests.get(url)
    comm = re.compile("<!--|-->")
    soup = BeautifulSoup(comm.sub("",res.text),'lxml')
    all_tables = soup.findAll("tbody")
    team_table = all_tables[0]
    player_table = all_tables[1]
    return player_table, team_table

def get_frame(features, player_table):
    pre_df_player = dict()
    features_wanted_player = features
    rows_player = player_table.find_all('tr')
    for row in rows_player:
        if(row.find('th',{"scope":"row"}) != None):
    
            for f in features_wanted_player:
                cell = row.find("td",{"data-stat": f})
                a = cell.text.strip().encode()
                text=a.decode("utf-8")
                if(text == ''):
                    text = '0'
                if((f!='player')&(f!='nationality')&(f!='position')&(f!='squad')&(f!='age')&(f!='birth_year')):
                    text = float(text.replace(',',''))
                if f in pre_df_player:
                    pre_df_player[f].append(text)
                else:
                    pre_df_player[f] = [text]
    df_player = pd.DataFrame.from_dict(pre_df_player)
    return df_player

def frame_for_category(category,top,end,features):
    url = (top + category + end)
    player_table, team_table = get_tables(url)
    df_player = get_frame(features, player_table)
    return df_player


def get_outfield_data(top, end):
    df1 = frame_for_category('stats',top,end,stats)
...
    return df

df_2018 = get_outfield_data('https://fbref.com/en/comps/Big5/2017-2018/','/players/2017-2018-Big-5-European-Leagues-Stats')
df_2018["player"] = df_2018["player"] + ', 2017-18'
...
df.head()

the problem comes in the 'get_frame' function, as you can see in here:


Cell In\[19\], line 2
1 player_table, team_table = get_tables('https://fbref.com/en/comps/Big5/2017-2018/stats/players/2017-2018-Big-5-European-Leagues-Stats')
\----\> 2 df_player = get_frame(stats, player_table)
3 pintf(df_player)

Cell In\[18\], line 21, in get_frame(features, player_table)
19 for f in features_wanted_player:
20     cell = row.find("td",{"data-stat": f})
\---\> 21     a = cell.text.strip().encode()
22     text=a.decode("utf-8")
23     if(text == ''):

AttributeError: 'NoneType' object has no attribute 'text'

I believe it has to be some part of the html data that is failing in the proccess, but i cant seem to find what it is.

I resumed the code, but you can find it in this git page: https://github.com/victorballesteros8/fbref-futbol/tree/main

I tried using the code in an IDE and debugging it to see what was causing the problem, but with that much data being scraped, it is difficult to find it.

1

There are 1 best solutions below

0
Luca On

This is meant to be more a comment than an answer, but I do not have enough reputation to comment (yet!).

You could try to wrap the cell throwing the error in a statement as

if cell is None:
    print(row)
else:
   # do the rest

If you can scrape any data, thus if the problem is related to few (maybe even one) specific entry of the table. You can then inspect the guilty lines to have a more precise idea of the problem, and maybe think about a better fix than just skipping them. If you cannot scrape any data, maybe the data has a format that is different from the one that the program expects... the Github record you used is poorly documented and last committed 1 and a half year ago, thus things might have changed.

Hope this helps!