Biopython and accessing fasta file stored in the computer

52 Views Asked by At

implement a program in Python that includes a function which:

  • takes as input argument the name of a file which stores protein sequences in FastA format.
  • from the file reads in the sequences using a suitable function/method in the Biopython package and stores these in a list.
  • for each protein sequence uses a function/method in the re module for extracting all non-overlapping matches for the patterns listed below. All non-overlapping matches should be printed to a results file together with protein ID for which the pattern search was made. Patterns to search for:
  • W, followed by any amino acid, followed by P
  • Two S in a row, followed by a D or L
  • Q followed by one or two A

Please download a protein sequences fasta file as I cannot upload mines

For finding a pattern I could not call my fasta file

1

There are 1 best solutions below

0
Umar On

Here's the Python program with the specified function: The find_patterns function takes the FASTA file name and results file name as arguments, then iterates through each sequence in the FASTA file, searching for the specified patterns using regular expressions, and writes the matches to the results file.

import re
from Bio import SeqIO

def find_patterns(fasta_file, results_file):
  """
  Finds specified patterns in protein sequences from a FASTA file and prints results to a file.

  Args:
    fasta_file (str): Name of the FASTA file containing protein sequences.
    results_file (str): Name of the file to write the results to.
  """

  patterns = [
    r"W.P",  # W, any amino acid, P
    r"SS[DL]",  # Two S, D or L
    r"QAA?"  # Q, one or two A
  ]

  with open(results_file, "w") as results:
    for record in SeqIO.parse(fasta_file, "fasta"):
      sequence = str(record.seq)
      protein_id = record.id

      for pattern in patterns:
        matches = re.finditer(pattern, sequence)
        for match in matches:
          results.write(f"{protein_id}\t{match.group()}\t{match.start()}\t{match.end()}\n")

# Example usage (replace with your actual file paths)
fasta_file = "protein_sequences.fasta"  # Replace with your FASTA file name
results_file = "pattern_matches.txt"  # Replace with your desired results file name
find_patterns(fasta_file, results_file)

suppose protein sequence is as follows

>FHHBHBFD_00002 hypothetical protein
MAITGRAAFIAALGSVPIGIWDPSWTGILAVNAPLAAACACDFALAAPVRRLGLTRSGDT
SARLGETADVTLTVTNPSGRPLRARLRDAWPPSSWQPGTETAASRHSLTVPAGERRRVTT
RLRPTRRGDRQADRVTIRSYGPLGLFTRQGTHRVPWTVRVLPPFTSRKHLPSKLSRLREL
DGRTSVLTRGEGTEFDSLREYVPGDDTRSIDWRATARQSTVAVRTWRPERDRHILLVLDT
GRTSAGRVGDAPRLDASMDAALLLAALASRAGDRVDLLAYDRRVRALLQGRTAGDVLPSL
VNAMATLEPELVETDARGLTATALRSAPRRSLIVLFTTLDTAPIEEGLLPVLPQLTQRHT
VLVASVADPHVAKMAEARGHTDAVYEAAAAAQAQSERRRTADQLRRHGVTVVDATPDELP
PALADAYLELKATGRL
>FHHBHBFD_00003 hypothetical protein
MMDPTTDNAGQTAAPGNARAALEALRAEIAKAVVGQDAAVTGLVVALLCRGHVLLEGVPG
VAKTLLVRTLAEATELDTKRVQFTPDLMPSDVTGSLVYDARTAEFSFQPGPVFTNLLLAD
EINRTPPKTQSSLLEAMEERQVTVDGTPRPLPEPFLVAATQNPVEYEGTYPLPEAQLDRF
LLKLTVPLPTRQDEIDVLSRHAAGFDPRDLHAAGVRPVAGAADLEAARAEAARTTVSPEI
TAYVVDICRATRESPSLTLGVSPRGATALLSTSRAWAWLTGRDYVTPDDVKALALPTLRH
RVQLRPEAEMEGVTTDSVINAILAHVPVPR

output will be as

FHHBHBFD_00002  WDP 20  23
FHHBHBFD_00002  WPP 89  92
FHHBHBFD_00002  WQP 94  97
FHHBHBFD_00002  WRP 225 228
FHHBHBFD_00002  QA  130 132
FHHBHBFD_00002  QA  391 393
FHHBHBFD_00003  SSL 130 133