I am using the code below to get any free journal pdfs from pubmed. It does downloadload something that when I look at it, just consists of the number 1.. Any ideas on where I am going wrong? Thank you
import metapub
from urllib.request import urlretrieve
import textract
from pathlib import Path
another_path='/content/Articles/'
pmid_list=['35566889','33538053', '30848212']
for i in range(len(pmid_list)):
query=pmid_list[i]
#for ind in pmid_df.index:
# query= pmid_df['PMID'][ind]
url = metapub.FindIt(query).url
try:
urlretrieve(url)
file_name = query
out_file = another_path + file_name
with open(out_file, "w") as textfile:
textfile.write(textract.process(out_file,extension='pdf',method='pdftotext',encoding="utf_8",
))
except:
continue
I see two mistakes.
First:
urlretrieve(url)saves data in temporary file with random filename - so you can't access it because you don't know its filename. You should use second parameter to save it with own filename.Second: you use the same
out_fileto process file (process(out_file)) and write result (open(out_file, 'w')) - but first you useopen()which deletes all content in file and later it will process empty file. You should first process file and later open it for writing.or you should write result with different name (i.e with extension
.txt)`Full working example with other small changes