403 Forbidden Error when Downloading XML.GZ File using Polite Package in R

115 Views Asked by At

I am trying to download a file from a URL using the polite package in R. Here is the code I am using:

library(polite)

# URL of the file to download
eprice_xml_products_1 <- "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz"

# Create a polite session
session <- bow(eprice_xml_products_1)

# Download the file using rip function
file_path <- rip(session, destfile = "xml_1.gz")

print(file_path)

I have also tried with this function:


    bow(eprice_xml_products_1) %>%
      nod("https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz") %>%
      rip()

But I get this error:


    trying URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz'
    Error in fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz",  : 
      cannot open URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz'
    In addition: Warning messages:
    1: In fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz",  :
      downloaded length 0 != reported length 334
    2: In fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz",  :
      cannot open URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz': HTTP status was '403 Forbidden'

If I just open the link with my browser the download of the file starts immediately

What am I missing?

1

There are 1 best solutions below

3
Till On BEST ANSWER

That page blocks requests for the url you are trying to access, when the user-agent value in the request headers is not a regular browser (Firefox, Chrome, ...). To make this work, you can change your user agent value to that of a Browser. Below is an example that works with utils::download.file(). A similar strategy might be available for polite.

# Set User Agent to current Firefox
  options(HTTPUserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/115.0")
  download.file("https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz", "Sitemap_Elettrodomestici_1.xml.gz")
  
  # Load XML from file
  library(xml2)
  read_xml("Sitemap_Elettrodomestici_1.xml.gz")
#> {xml_document}
#> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
#>  [1] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
#>  [2] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
#>  [3] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
#>  [4] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DHaier%2D ...
#>  [5] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DMIDEA%2DFrigorif ...
#>  [6] <url>\n  <loc>https://www.eprice.it/Accessori%2DFrigoriferi%2DELECTROLUX ...
#>  [7] <url>\n  <loc>https://www.eprice.it/accessori%2DIMPERIA/d%2D1597166</loc ...
#>  [8] <url>\n  <loc>https://www.eprice.it/accessori%2DIMPERIA/d%2D2489361</loc ...
#>  [9] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DDe%20Longhi/d% ...
#> [10] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DELECTROLUX/d%2 ...
#> [11] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DELECTROLUX/d%2 ...
#> [12] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DELUX%20INC/d%2 ...
#> [13] <url>\n  <loc>https://www.eprice.it/accessori%2DKENWOOD/d%2D5551714</loc ...
#> [14] <url>\n  <loc>https://www.eprice.it/accessori%2DKENWOOD/d%2D7625838</loc ...
#> [15] <url>\n  <loc>https://www.eprice.it/accessori%2DKitchenAid/d%2D50118434< ...
#> [16] <url>\n  <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
#> [17] <url>\n  <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
#> [18] <url>\n  <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
#> [19] <url>\n  <loc>https://www.eprice.it/accessori%2Dmacchine%2Dcaffe%2DDE%20 ...
#> [20] <url>\n  <loc>https://www.eprice.it/accessori%2Dmacchine%2Dcaffe%2DDE%20 ...
#> ...