Here is my working python script for downloading fasta sequences from UniProt (with real appreciation to the community). '''
UniProt fasta downloader using accession ids from a text file,
show the download progress for each downloading sequence,
and make a list of unaccessible sequnces
'''
import functools
import pathlib
import shutil
import requests
from tqdm.auto import tqdm
#Part I: Read the file with IDs and make a list of urls to download the respective sequences
with open ('errtest.txt', 'r') as infile:
lines = infile.readlines()
listfile_name = infile.name
file_name = listfile_name.split('.', 1)[0]
downloaded = 0 #sequences downloaded
URL_list = []
for line in lines:
access_id = line.strip()
url_part1 = 'https://rest.uniprot.org/uniprotkb/'
url_part2 = '.fasta'
URL = url_part1+access_id+url_part2
URL_list.append(URL)
not_found = []
for url in URL_list:
r = requests.get(url, stream=True, allow_redirects=True)
file_size = int(r.headers.get('Content-Length', 0))
if r.status_code != 200:
Apart = url.removeprefix('https://rest.uniprot.org/uniprotkb/')
short_id = Apart.removesuffix('.fasta')
not_found.append (short_id)
print (short_id, '-- not found')
elif r.status_code == 200:
path = pathlib.Path((file_name)+'seqs.fa').expanduser().resolve()
path.parent.mkdir(parents=True, exist_ok=True)
desc = "(Unknown total file size)" if file_size == 0 else ""
r.raw.read = functools.partial(r.raw.read, decode_content=True) # Decompress if needed
with tqdm.wrapattr(r.raw, "read", total=file_size, desc=desc) as r_raw:
with path.open("ab") as f:
shutil.copyfileobj(r_raw, f)
downloaded += 1
print ('Sequences with these accesion ids were not found:\n', not_found)
print (downloaded, 'sequences downloaded')
These are the contents of errtest.txt file (some wrong IDs to count and some correct IDs):
wrong1
D3VN13
B9W4V6
wrong2
A0A8S0XZH6
wrong3
This is the typical output:
wrong1 -- not found
0%| | 0/477 [00:00<?, ?it/s]
100%|██████████| 477/477 [00:00<00:00, 239kB/s]
0%| | 0/473 [00:00<?, ?it/s]
100%|██████████| 473/473 [00:00<00:00, 42.4kB/s]
wrong2 -- not found
0%| | 0/534 [00:00<?, ?it/s]
100%|██████████| 534/534 [00:00<00:00, 268kB/s]
wrong3 -- not found
Sequences with these accesion ids were not found:
['wrong1', 'wrong2', 'wrong3']
3 sequences downloaded
So far, so good. Next, I want to make a single progress bar for all the downloads. In this text file, there are only 3 legit IDs and 3 wrong IDs (which happens sometimes) and three progress bars can be shown one after another. But in reality, thousands of IDs will be in the list file, with 1000s or URLs and respective sequence downloads. So it would be ideal to have a single progress bar showing the downloading progress.
I think that you could compute the total size before starting the loop of the download, and then use a unique progress bar, something like this: