i have a list of insurers in spain - it is collected in 24 rubriques - on a website: See the following
insurandes - espanol: the full list: https://www.unespa.es/en/directory
it is divided into 24 pages: https://www.unespa.es/en/directory/#A https://www.unespa.es/en/directory/#Z
idea - what is aimed: i want to fetch the data from the pages- with BS4 and request - and finally save it into a dataframe: Well - the task of scraping the list from the website using BeautifulSoup (BS4) and requests in Python seems to be apropiate; i think that we need to go the following steps:
a. firstly we need to import necessary libraries: BeautifulSoup, requests, and pandas. b. then we need to use the requests library to get the HTML content of each of the pages that are interesting: i.e. A to Z-page. c. then i use BeautifulSoup to parse the HTML content. d. subsequently i think extracting the relevant information (insurers' names) from the parsed HTML is the next step e. finally i want to store the extracted data in a pandas DataFrame.
but this does not work... - also not for the iteration from A to Z:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Function to scrape insurers from a given URL
def scrape_insurers(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting insurer names
insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
return insurers
else:
print("Failed to retrieve data from", url)
return []
# Define the base URL
base_url = "https://www.unespa.es/en/directory/"
# List to store all insurers
all_insurers = []
# Loop through each page (A to Z)
for char in range(65, 91): # ASCII codes for A to Z
page_url = f"{base_url}#{chr(char)}"
insurers = scrape_insurers(page_url)
all_insurers.extend(insurers)
# Convert the list of insurers to a pandas DataFrame
df = pd.DataFrame({'Insurer': all_insurers})
# Display the DataFrame
print(df.head())
# Save DataFrame to a CSV file
df.to_csv('insurers_spain.csv', index=False)
....it fails with the following results:
Failed to retrieve data from https://www.unespa.es/en/directory/#A
Failed to retrieve data from https://www.unespa.es/en/directory/#B
Failed to retrieve data from https://www.unespa.es/en/directory/#C
Failed to retrieve data from https://www.unespa.es/en/directory/#D
Failed to retrieve data from https://www.unespa.es/en/directory/#E
and so forth and so forth:
well i think it is quite easier to reduce the steps of complexity in the first place.
i think that its better to take one single URL i want to visit. It is just better to test what results we get back with our request. After this is finished, now i can evaluate the request; well i think i can use the beautiful soup lib to check for specific fields in common. well i think that i should avoid to do three things (which can obviously terrible wrong) in one step.
so i do it like so for the first character: for A:
import requests
from bs4 import BeautifulSoup
# Function to scrape insurers from a given URL
def scrape_insurers(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting insurer names
insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
return insurers
else:
print("Failed to retrieve data from", url)
return []
# Define the base URL
base_url = "https://www.unespa.es/en/directory/#"
# Define the character we want to fetch data for
char = 'A'
# Construct the URL for the specified character
url = base_url + char
# Fetch and print data for the specified character
insurers_char = scrape_insurers(url)
print(f"Insurers for character '{char}':")
print(insurers_char)
but see the Output here:
Failed to retrieve data from https://www.unespa.es/en/directory/#A
Insurers for character 'A':
[]
Try:
Prints: