Problem with scrapping a large list of stock data from Yahoo finance

168 Views Asked by At

I want to scrap the "key-statistics" tab from Yahoo Finance. The HTML page contains multiple tables that I scrapped using Beautiful Soup. Each table contains only 2 columns, and I managed to scrap them using both HTML tags "table, td and tr" and Pandas' "read_html" function.

The tables are concatenated into a single dataframe using this code

 response = requests.get(url, headers={'user-agent': 'custom'})
 soup = BeautifulSoup(response.content, 'html.parser')
 key_stats = pd.DataFrame(columns=["indicator", "value"])
 tables = pd.read_html(str(soup))

 for table in tables:
     table.columns = ['indicator', 'value' ]
     key_stats = pd.concat([key_stats, table], axis=0)
 
 key_stats = key_stats.set_index("indicator")

The code works perfectly when using a small list of stocks, however when trying to use the same code for a large list (5665 stock) the following error occurs.

 ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

This error appears randomly on certain stocks and it means that the tables being scrapped contain 1 column, which is not true.

The most confusing part about this, is that the code works fine when re-executed using the same stocks that generated an error.

I could not understand what's causing this issue, could anyone help me with that ?

1

There are 1 best solutions below

2
VonC On BEST ANSWER

As commented, without seeing the HTML data, all we know is that you get inconsistencies in the data being scraped from Yahoo Finance's "key-statistics" tab (visible here) for many stocks. That means you need to implement some strategies to be more robust in the face of those inconsistencies:

  • Before concatenating the tables into your key_stats DataFrame, validate that each table indeed has the expected two columns. That can be done by checking the shape of the DataFrame.
  • Implement a try-except block to catch and handle the ValueError. That will allow you to log or investigate the problematic stocks without interrupting the entire scraping process.
  • Enhance your scraping logic to handle cases where the page layout might differ or where certain expected tables are absent.

For instance:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "your_yahoo_finance_url"
response = requests.get(url, headers={'user-agent': 'custom'})
soup = BeautifulSoup(response.content, 'html.parser')
key_stats = pd.DataFrame(columns=["indicator", "value"])
tables = pd.read_html(str(soup))

for table in tables:
    try:
        # Validate the table structure
        if table.shape[1] == 2:
            table.columns = ['indicator', 'value']
            key_stats = pd.concat([key_stats, table], axis=0)
        else:
            print("Skipped a table with unexpected format.")
    except ValueError as e:
        print(f"Error processing a table: {e}")

key_stats = key_stats.set_index("indicator")

That way, your code checks that each table has exactly two columns before attempting to rename columns and concatenate it.
It also catches any ValueError that might occur during the process, allowing you to log or handle it as needed.
You can better handle variability in the data being scraped.


The problem actually was a 503 error from the server. I have a list of 5800 stock, so I guess scrapping all this data is being blocked by Yahoo's servers.

So it is clear that Yahoo Finance's servers are likely imposing limitations to prevent extensive scraping activities. A 503 Service Unavailable error typically indicates that the server is temporarily unable to handle the request due to overload or maintenance, which is common in web scraping scenarios involving a large number of requests, such as your case with 5,800 stocks.

To mitigate that, you can consider:

  • introducing delays ("throttling") between consecutive requests to reduce the load on Yahoo Finance's servers. That can be achieved with the time.sleep() function in Python. Although this will increase the total scraping time, it reduces the likelihood of being blocked by the server.
import time

# Example delay of 1 second between requests
time.sleep(1)
  • breaking down the list into smaller batches ("batch processing"): that would not only mitigates server load but also provides natural checkpoints for your scraping process, allowing for easier recovery and troubleshooting.

  • rotating user-agents and utilizing proxies: it can help mimic the behavior of multiple, distinct users accessing the website from different locations, further reducing the likelihood of being blocked.

Some websites provide APIs with higher request limits for registered or authenticated users. Check if Yahoo Finance offers a formal API that could be used for your purposes, potentially providing a more reliable and server-friendly way to access the data.

Do implement logic to detect 503 responses and retry the request after a delay. Exponential backoff strategies can be particularly effective, gradually increasing the delay after each failed attempt to a certain threshold. See more with "Python Requests: Retry Failed Requests (2024)"

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def fetch_stock_data(stock_list):
    key_stats_combined = pd.DataFrame(columns=["indicator", "value"])
    for stock_url in stock_list:
        try:
            response = requests.get(stock_url, headers={'user-agent': 'custom'})
            if response.status_code == 503:
                print(f"503 Error for {stock_url}, retrying...")
                time.sleep(5)  # Wait for 5 seconds before retrying
                continue
            soup = BeautifulSoup(response.content, 'html.parser')
            tables = pd.read_html(str(soup))
            for table in tables:
                if table.shape[1] == 2:
                    table.columns = ['indicator', 'value']
                    key_stats_combined = pd.concat([key_stats_combined, table], axis=0)
        except Exception as e:
            print(f"Error processing stock {stock_url}: {e}")
        time.sleep(1)  # Throttle requests to avoid hitting server limits
    key_stats_combined = key_stats_combined.set_index("indicator")
    return key_stats_combined

# Assume 'stock_list' is a list of URLs to scrape
# stock_list = ["url1", "url2", ...]
# key_stats = fetch_stock_data(stock_list)