I want to scrap the "key-statistics" tab from Yahoo Finance. The HTML page contains multiple tables that I scrapped using Beautiful Soup. Each table contains only 2 columns, and I managed to scrap them using both HTML tags "table, td and tr" and Pandas' "read_html" function.
The tables are concatenated into a single dataframe using this code
response = requests.get(url, headers={'user-agent': 'custom'})
soup = BeautifulSoup(response.content, 'html.parser')
key_stats = pd.DataFrame(columns=["indicator", "value"])
tables = pd.read_html(str(soup))
for table in tables:
table.columns = ['indicator', 'value' ]
key_stats = pd.concat([key_stats, table], axis=0)
key_stats = key_stats.set_index("indicator")
The code works perfectly when using a small list of stocks, however when trying to use the same code for a large list (5665 stock) the following error occurs.
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
This error appears randomly on certain stocks and it means that the tables being scrapped contain 1 column, which is not true.
The most confusing part about this, is that the code works fine when re-executed using the same stocks that generated an error.
I could not understand what's causing this issue, could anyone help me with that ?
As commented, without seeing the HTML data, all we know is that you get inconsistencies in the data being scraped from Yahoo Finance's "key-statistics" tab (visible here) for many stocks. That means you need to implement some strategies to be more robust in the face of those inconsistencies:
key_statsDataFrame, validate that each table indeed has the expected two columns. That can be done by checking the shape of the DataFrame.ValueError. That will allow you to log or investigate the problematic stocks without interrupting the entire scraping process.For instance:
That way, your code checks that each table has exactly two columns before attempting to rename columns and concatenate it.
It also catches any
ValueErrorthat might occur during the process, allowing you to log or handle it as needed.You can better handle variability in the data being scraped.
So it is clear that Yahoo Finance's servers are likely imposing limitations to prevent extensive scraping activities. A 503 Service Unavailable error typically indicates that the server is temporarily unable to handle the request due to overload or maintenance, which is common in web scraping scenarios involving a large number of requests, such as your case with 5,800 stocks.
To mitigate that, you can consider:
time.sleep()function in Python. Although this will increase the total scraping time, it reduces the likelihood of being blocked by the server.breaking down the list into smaller batches ("batch processing"): that would not only mitigates server load but also provides natural checkpoints for your scraping process, allowing for easier recovery and troubleshooting.
rotating user-agents and utilizing proxies: it can help mimic the behavior of multiple, distinct users accessing the website from different locations, further reducing the likelihood of being blocked.
Some websites provide APIs with higher request limits for registered or authenticated users. Check if Yahoo Finance offers a formal API that could be used for your purposes, potentially providing a more reliable and server-friendly way to access the data.
Do implement logic to detect 503 responses and retry the request after a delay. Exponential backoff strategies can be particularly effective, gradually increasing the delay after each failed attempt to a certain threshold. See more with "Python Requests: Retry Failed Requests (2024)"