I'm trying parsing and hierarchical display of Gene Ontology (GO) terms from an OBO file using Python. While I have made progress, I'm encountering an issue with properly handling multiple is_a relationships within the same term. My goal is to achieve a hierarchical structure that considers all is_a relationships.
I'm working with a subset of the Gene Ontology data from the go-basic.obo file. Here's an example of the data format:
format-version: 1.2
data-version: releases/2023-06-11
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_drosophila "Drosophila GO slim"
subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: prokaryote_subset "GO subset for prokaryotes"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
ontology: go
[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
[Term]
id: GO:0048308
name: organelle inheritance
namespace: biological_process
def: "The partitioning of organelles between daughter cells at cell division." [GOC:jid]
subset: goslim_pir
subset: goslim_yeast
is_a: GO:0006996 ! organelle organization
[Term]
id: GO:0007029
name: endoplasmic reticulum organization
namespace: biological_process
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of the endoplasmic reticulum." [GOC:dph, GOC:jl, GOC:mah]
subset: goslim_pir
synonym: "endoplasmic reticulum morphology" RELATED []
synonym: "endoplasmic reticulum organisation" EXACT []
synonym: "endoplasmic reticulum organization and biogenesis" RELATED [GOC:mah]
synonym: "ER organisation" EXACT []
synonym: "ER organization and biogenesis" RELATED [GOC:mah]
is_a: GO:0006996 ! organelle organization
relationship: part_of GO:0010256 ! endomembrane system organization
[Term]
id: GO:0048309
name: endoplasmic reticulum inheritance
namespace: biological_process
def: "The partitioning of endoplasmic reticulum between daughter cells at cell division." [GOC:jid]
synonym: "ER inheritance" EXACT []
is_a: GO:0007029 ! endoplasmic reticulum organization
is_a: GO:0048308 ! organelle inheritance
[Term]
id: GO:0048313
name: Golgi inheritance
namespace: biological_process
def: "The partitioning of Golgi apparatus between daughter cells at cell division." [GOC:jid, PMID:12851069]
synonym: "Golgi apparatus inheritance" EXACT []
synonym: "Golgi division" EXACT [GOC:ascb_2009, GOC:dph, GOC:tb]
synonym: "Golgi partitioning" EXACT []
is_a: GO:0007030 ! Golgi organization
is_a: GO:0048308 ! organelle inheritance
[Term]
id: GO:0007030
name: Golgi organization
namespace: biological_process
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of the Golgi apparatus." [GOC:dph, GOC:jl, GOC:mah]
subset: goslim_pir
synonym: "Golgi apparatus organization" EXACT []
synonym: "Golgi organisation" EXACT []
synonym: "Golgi organization and biogenesis" RELATED [GOC:mah]
is_a: GO:0006996 ! organelle organization
relationship: part_of GO:0010256 ! endomembrane system organization
[Term]
id: GO:0090166
name: Golgi disassembly
namespace: biological_process
def: "A cellular process that results in the breakdown of a Golgi apparatus that contributes to Golgi inheritance." [GOC:ascb_2009, GOC:dph, GOC:tb]
synonym: "Golgi apparatus disassembly" EXACT []
is_a: GO:0007030 ! Golgi organization
is_a: GO:1903008 ! organelle disassembly
relationship: part_of GO:0048313 ! Golgi inheritance
[Term]
id: GO:1903008
name: organelle disassembly
namespace: biological_process
def: "The disaggregation of an organelle into its constituent components." [GO_REF:0000079, GOC:TermGenie]
synonym: "organelle degradation" EXACT []
is_a: GO:0006996 ! organelle organization
is_a: GO:0022411 ! cellular component disassembly
[Term]
id: GO:0006996
name: organelle organization
namespace: biological_process
alt_id: GO:1902589
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of an organelle within a cell. An organelle is an organized structure of distinctive morphology and function. Includes the nucleus, mitochondria, plastids, vacuoles, vesicles, ribosomes and the cytoskeleton. Excludes the plasma membrane." [GOC:mah]
subset: goslim_candida
subset: goslim_pir
synonym: "organelle organisation" EXACT []
synonym: "organelle organization and biogenesis" RELATED [GOC:dph, GOC:jl, GOC:mah]
synonym: "single organism organelle organization" EXACT [GOC:TermGenie]
synonym: "single-organism organelle organization" RELATED []
is_a: GO:0016043 ! cellular component organization
I used this code
def parse_obo(file_path):
terms = {}
current_term = None
with open(file_path, 'r') as f:
for line in f:
line = line.strip()
if not line:
if current_term:
terms[current_term['id']] = current_term
current_term = None
elif line.startswith('[Term]'):
if current_term:
terms[current_term['id']] = current_term
current_term = {'id': ''}
elif current_term:
parts = line.split(': ', 1)
if len(parts) == 2:
current_term[parts[0]] = parts[1]
return terms
def display_hierarchy(terms, term_id, indent=0):
if term_id in terms:
term = terms[term_id]
print(' ' * indent + term_id)
if 'is_a' in term:
parent_ids = [parent.split()[1] for parent in term['is_a'] if len(parent.split()) > 1]
for parent_id in parent_ids:
display_hierarchy(terms, parent_id, indent + 4)
if 'id' in term:
child_ids = [child_id for child_id in terms if term_id in terms[child_id].get('is_a', [])]
for child_id in child_ids:
display_hierarchy(terms, child_id, indent + 4)
if __name__ == "__main__":
file_path = 'go-basic_1.obo'
terms = parse_obo(file_path)
for term_id in terms:
display_hierarchy(terms, term_id, indent=0)
I got like this
GO:0000001
GO:0048308
GO:0048309
GO:0048313
GO:0007029
GO:0048309
GO:0048313
GO:0007030
GO:0090166
GO:1903008
GO:0090166
GO:0006996
GO:0048308
GO:0048309
GO:0048313
GO:0007029
GO:0007030
but I want result like this
GO:0016043
GO:0006996
GO:1903008
GO:0090166
GO:0048308
GO:0000001
GO:0048309
GO:0048313
GO:0007029
GO:0048309
GO:0007030
GO:0048313
GO:0090166
GO:0048311
GO:0000001
GO:0022411
GO:1903008
GO:0090166
I want to plot result from my genomic data for gene ontology, so I started from here , kindly help
You would need to take care of these points:
As
is_amay occur multiple times per item, you would need to collect them in a collection, as otherwise you will overwrite a previous value and only retain the last value encountered per term. I would generalise this, and make all items in a term to have list values, except maybe forid, which should occur only once per termTo display the hierarchy you would benefit from having the relation from parent to children, instead of child to parents. So I would suggest including a separate function to add this inversed relationship to the terms.
Here is how that would look: