Why am I getting a 403 error when connecting to a url that works

800 Views Asked by At

I am trying to pull the quarter end dates for a company from the SEC government website. For some reason I keep getting a connection error. The code is working for my friend who is in the US, but not for me in Canada. I tried using a VPN, but was still getting the same error. Here is the code and the error that I was getting.

When I put the url into google it brings me to the page with all the information so I am not sure why I cant pull it into R.

library(derivmkts)
library(quantmod)
library(jsonlite)
library(tidyverse)

url = "https://data.sec.gov/submissions/CIK0000320193.json"
df <- fromJSON(url, flatten = T)

Error in open.connection(con, "rb") : 
  cannot open the connection to 'https://data.sec.gov/submissions/CIK0000320193.json'
In addition: Warning message:
In open.connection(con, "rb") :
  cannot open URL 'https://data.sec.gov/submissions/CIK0000320193.json': HTTP status was '403 Forbidden'

I am not expecting a 403 error when connecting to this url

1

There are 1 best solutions below

1
margusl On

They ask you to declare user agent in request headers - https://www.sec.gov/os/accessing-edgar-data

Apparently the one provided as an example is also accepted, though you really should provide your contact details there.

With httr2, it still uses jsonlite for parsing JSON responses:

library(httr2)

resp <- request("https://data.sec.gov/submissions/CIK0000320193.json") |>
  req_user_agent("Sample Company Name AdminContact@<sample company domain>.com") |>
  # set verbosity level for debugging, 1: show headers
  req_perform(verbosity = 1)
#> -> GET /submissions/CIK0000320193.json HTTP/1.1
#> -> Host: data.sec.gov
#> -> User-Agent: Sample Company Name AdminContact@<sample company domain>.com
#> -> Accept: */*
#> -> Accept-Encoding: deflate, gzip
#> -> 
#> <- HTTP/1.1 200 OK
#> <- Content-Type: application/json
#> <- x-amzn-RequestId: c634dcbe-68aa-4777-9f18-4edfae752eb4
#> <- Access-Control-Allow-Origin: *
#> <- x-amz-apigw-id: IvJu4HiHIAMFidw=
#> <- X-Amzn-Trace-Id: Root=1-64c2bcc5-5db9315369e664da512cb6b5
#> <- Vary: Accept-Encoding
#> <- Content-Encoding: gzip
#> <- Expires: Thu, 27 Jul 2023 18:51:49 GMT
#> <- Cache-Control: max-age=0, no-cache, no-store
#> <- Pragma: no-cache
#> <- Date: Thu, 27 Jul 2023 18:51:49 GMT
#> <- Content-Length: 28594
#> <- Connection: keep-alive
#> <- Strict-Transport-Security: max-age=31536000 ; preload
#> <- Set-Cookie: ak_bmsc=E9...

resp
#> <httr2_response>
#> GET https://data.sec.gov/submissions/CIK0000320193.json
#> Status: 200 OK
#> Content-Type: application/json
#> Body: In memory (157568 bytes)

# first few keys / values from JSON:
resp_body_json(resp, simplifyVector = TRUE, flatten = TRUE) |>
  head(n = 10) |>
  str()
#> List of 10
#>  $ cik                              : chr "320193"
#>  $ entityType                       : chr "operating"
#>  $ sic                              : chr "3571"
#>  $ sicDescription                   : chr "Electronic Computers"
#>  $ insiderTransactionForOwnerExists : int 0
#>  $ insiderTransactionForIssuerExists: int 1
#>  $ name                             : chr "Apple Inc."
#>  $ tickers                          : chr "AAPL"
#>  $ exchanges                        : chr "Nasdaq"
#>  $ ein                              : chr "942404110"

Created on 2023-07-27 with reprex v2.0.2

I'm from EU, I can open that JSON URL in the browser without any issues, but default jsonlite & httr2 agents are blocked. Using my browser's agent with httr2 works only when I also set accept-language. They check for some weird pattern in user agent when request is not coming from browser,
i.e. "foo_bar" - NOK / "foo.bar" - OK