I would like to be able to see the viral host organism from a number of Genbank records. I have tried this through downloading Genbank full files and reading them with Biopython.SeqIO.read(), and I have also tried querying the database through Entrez.efetch this is an example using only one ID:
$ pip install biopython
from Bio import Entrez, SeqIO
id = 'CY238774.1'
handle = Entrez.efetch(db='nucleotide', id=id, rettype='gb', retmode='text')
record = SeqIO.read(handle, 'gb')
When I look up this id record on NCBI through the web browser, I can see that in the record is says 'host=Homo sapiens'. This text is also present on the downloaded .gb file. However, I cannot find this information anywhere in the SeqRecord object created above. It appears this information is being lost when the SecRecord is created. I have checked all the class attributes.
Is there a way to extract this information from the SeqRecord?
Let's review that record:
So we're looking for Features --> Source --> Host.
Now let's switch to the API. Turns out that the very first Feature that came back has type Source.
Ta da! It was lurking within.