i'm a beginner to learning python, and need some help in trying to get the following details from a multi-fasta file:

  1. list out all ORFs in all three reading frames for each sequence in a multi fasta format
  2. list out the no. of ORFs and longest ORF for each sequence along with its position
  3. find out the identifier/sequence description starting with a '>' for the longest ORfs.

I cannot seem to get even the lsiting of ORFs right :'). I'd really appreciate all the help i can get.

I tried listing out the ORFs from each sequence and counting them by first making a dictionary whose keys are the sequence descriptions/identifiers starting with a '>' and values are the sequences themselves.

my code:

def findcodon(file,frame):
    f=open(file)
    dictionary=makedictionary(file)
    mlist = list(dictionary.values())
    orfcount=0
    startpos=0
    endpos=0
    stopcodon=['tga','tag','taa','TGA','TAG','TGA']
    startcodon=['ATG','atg']
    for i in mlist:
        for j in range(frame,len(i),3):
            codon=i[j:j+3]
            if codon in startcodon:
                startpos=int(j)
            elif codon in stopcodon:
                endpos=int(j)
        orf=i[j][startpos:endpos]
        orfcount+=1
        print('orf',orfcount,':',orf)
        print('total no.of orfs:',orfcount)


findcodon('myseq.txt',1)

My output:

orf 1 : 
total no.of orfs: 1
orf 2 : 
total no.of orfs: 2
orf 3 : 
total no.of orfs: 3
orf 4 : 
total no.of orfs: 4
orf 5 : 
total no.of orfs: 5

1

There are 1 best solutions below

0
Umar On BEST ANSWER

In the original code, the start and end positions for each ORF were being overwritten in each loop iteration, leading to incorrect ORF extraction. Additionally, the logic for extracting ORFs using list slicing wasn't accurate, and the total number of ORFs was printed inside the loop, resulting in a count of 1 for each ORF due to repetition.

def find_codon(file, frame):
    # Open the file and read sequences into a dictionary
    dictionary = {}
    with open(file, 'r') as f:
        identifier = ''
        sequence = ''
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                if identifier:
                    dictionary[identifier] = sequence
                identifier = line
                sequence = ''
            else:
                sequence += line

        # Add the last sequence
        if identifier:
            dictionary[identifier] = sequence

    # Define start and stop codons
    stop_codons = {'tga', 'tag', 'taa'}
    start_codon = 'atg'

    # Iterate through each sequence
    for identifier, sequence in dictionary.items():
        orf_count = 0
        longest_orf = ('', '', 0)  # (ORF, position, length)

        # Loop through each reading frame
        for i in range(frame):
            start_pos = None
            for j in range(i, len(sequence), 3):
                codon = sequence[j:j+3].lower()
                if codon == start_codon:
                    start_pos = j
                elif codon in stop_codons and start_pos is not None:
                    orf_count += 1
                    orf = sequence[start_pos:j+3]
                    orf_length = j - start_pos + 3
                    if orf_length > longest_orf[2]:
                        longest_orf = (orf, start_pos, orf_length)
                    print(f'ORF {orf_count}: {orf} at position {start_pos}')
                    start_pos = None

        print(f'Total number of ORFs: {orf_count}')
        if longest_orf[0]:
            print(f'Longest ORF: {longest_orf[0]} at position {longest_orf[1]}')

# Example usage
find_codon('example_fasta.fasta', 3)