I have currently managed to scrape all filings for a specific ticker eg. 'AAPL' and every type of filing with its link is presented in a massive dictionary. I would like only those links where the 'type':'10-k' and download all the files as HTML files. Have tried to looping over dictionary and appending to list but still getting all the types.
from urllib.request import urlopen
import certifi
import json
response = urlopen("https://financialmodelingprep.com/api/v3/sec_filings/AMZN?page=0&apikey=aa478b6f376879bc58349bd2a6f9d5eb", cafile=certifi.where())
data = response.read().decode("utf-8")
print (json.loads(data))
list = []
for p_id in data:
if p_id['type'] == '10-K':
list.append((p_id['finalLink']))
print(list)
#print(get_jsonparsed_data(url))
The result for this code is shown below where every type is being outputted when only 10-k is needed:
{'symbol': 'AMZN', 'fillingDate': '2014-01-31 00:00:00', 'acceptedDate': '2014-01-30 21:52:38', 'cik': '0001018724', 'type': '10-K', 'link': 'https://www.sec.gov/Archives/edgar/data/1018724/000101872414000006/0001018724-14-000006-index.htm', 'finalLink': 'https://www.sec.gov/Archives/edgar/data/1018724/000101872414000006/amzn-20131231x10k.htm'}, {'symbol': 'AMZN', 'fillingDate': '2014-01-31 00:00:00', 'acceptedDate': '2014-01-30 21:49:36', 'cik': '0001018724', 'type': 'SC 13G/A', 'link': 'https://www.sec.gov/Archives/edgar/data/1018724/000119312514029210/0001193125-14-029210-index.htm', 'finalLink': 'https://www.sec.gov/Archives/edgar/data/1018724/000119312514029210/d659830dsc13ga.htm'}, {'symbol': 'AMZN', 'fillingDate': '2014-01-30 00:00:00', 'acceptedDate': '2014-01-30 16:20:30', 'cik': '0001018724', 'type': '8-K', 'link':
If the links get appended to the list I would ideally like to download all of them at once and save in folder. Have previously used sec_edgar_downloader package however it downloads all the 10-k files in their respective yearly folders.
Instead of filtering the list of all SEC filings on the client side in your Python code, you can actually filter them directly on the server side. Considering your final objective is to download thousands of 10-K filings filed over many years, and not just Apple's 10-Ks, you're saving yourself a lot of time by filtering on the server side.
Just FYI, there are other 10-K form variants, i.e. 10-KT, 10KSB, 10KT405, 10KSB40, 10-K405. I'm not sure if you are aware of them and want to ignore them, or if you didn't know but also want to download the other variants.
Let's run through a full-fledged 10-K filing downloader implementation. Our application will be structured into two components:
1. Generate the list of 10-K URLs
The Query API is a search interface allowing us to search and find SEC filings across the entire EDGAR database by any filing meta data parameter. For example, we can find all 10-K filings filed by Apple using a ticker and form type search (
formType:"10-K" AND ticker:AAPL) or build more complex search expressions using boolean and brackets operators.The Query API returns the meta data of SEC filings matching the search query, including the URLs to the filings themselves.
The response of the Query API package represents a dictionary (short: dict) with two keys:
totalandfilings. The value oftotalis a dict itself and tells us, among other things, how many filings in total match our search query. The value offilingsis a list of dicts, where each dict represents all meta data of a matching filing.The URL of a 10-K filing is the value of the
linkToFilingDetailskey in each filing dict, for example: https://www.sec.gov/Archives/edgar/data/1318605/000119312514069681/d668062d10k.htmIn order to for us to generate a complete list of 10-K URLs, we simply iterate over all filing dicts, read the
linkToFilingDetailsvalue and write the URL to a local file.The URL downloader appends a new URL to the log file
filing_urls.txton each processing iteration. In case you accidentally shut down your application, you can start off from the most recently processed year without having to download already processed URLs again.After running the code, you should see something like this:
2. Download all 10-Ks from SEC EDGAR
The second component of our filing downloader loads all 10-K URLs from our log file
filing_urls.txtinto memory, and downloads 20 filings in parallel into the folderfilings. All filings are downloaded into the same folder.We use the Render API interface of the SEC-API Python package to download the filing by providing its URL. The Render API allows us to download up to 40 SEC filings per second in parallel. However, we don’t utilize the full bandwidth of the API because otherwise it’s very likely we end up with memory overflow exceptions (considering some filings are 400+ MB large).
The
download_filingfunction downloads the filing from the URL, generates a file name using the last two parts of the URL and saves the downloaded file to thefilingsfolder.The
download_all_filingsis the heart and soul of our application. Here, Python's inbuiltmultiprocessing.Poolmethod allows us to apply a function to a list of values multiple times in parallel. This way we can apply thedownload_filingfunction to values of the URLs list in parallel.For example, setting
number_of_processesto 4 results in 4download_filingfunctions running in parallel where each function processes one URL. Once a download is completed,multiprocessing.Poolgets the next URL from the URLs list and callsdownload_filingwith the new URL.Finally, run
download_all_filings()to start downloading all 10-K filings. Yourfilingsfolder should fill up with downloaded 10-K filings and look like this: