I have a fasta file with different amino acid sequences, for
example : example.fasta :
>abc
HSTSDSAQTMFPVALLLLAAGSCVKGEQLTQPTSVTVQPGQRLTITCQVSYSLGTYFTAW
IRQPAGKGLEWIGMRSTGASYYKDSLKNKFSIDLDTSSKTVTLNGQNVQPEDTAVYYCAR
APSRGFDYWGKGTMVTITSATPKGPTVFPL
>def
TARQIQHKPCFL*LCCCWQLDHV*RVNS*HSRPL*LCSQVNV*PSPVRSLILLVPTSQLG
SDSLQEKDWSGLE*DLLELHTTKIH*RTSSVST*TLPAKL*L*MDRMCSLKTLLCITVPE
RPVGVLTTGGKAPWSPSPRPPQRDQLCFL*
>ghi
GSQHVRFSTNHVSCSSAAVGSWIMCEG*TVDTADLCDCAARSTSDHHLSGLLFSW*LLHS
LDQTACRKRTGVDWEQIYWSCILQRFIKEQVQYRLRHFQQNCDSKWTECAA*RHCCVLLC
QTTGSGSWLLGERHHGHHHLGHPKGTNCVSS
and I want to filter out the sequences that are "productive" from the "non-productive" ones.
Additional info: I had translated every DNA sequence to amino acid sequence in all 6 frames.
By "non-productive" I mean those that don't translate into proteins (don't have the amino acid M and/or have too many stop codons). I would like to filter out these non-productive sequences in a fasta file.
As for the "productive" ones, I would also like to save every "productive" sequence only with the complete frame in another fasta file.
An example using
biopythonand a threshold on the number of stop codons.Output:
You can add a mode complex logic by replacing
s.count('*')<3by a custom function:writing as fasta:
Output:
Note that if you only need the files, you can directly write the sequence in the loop: