I am not a bioinformatician and my question may sound basic.
I have some issues with RDKit
The issue: there are some sequences that have X in the antimicrobial peptide sequence. Seems that RDKit cannot process these cases. For example the following sequences:
seq = 'HFXGTLVNLAKKIL', 'HFLGXLVNLAKKIL', 'HFLGTLVNXAKKIL', 'fPVXLfPXXL', 'SRWPSPGRPRPFPGRPKPIFRPRPXNXYAPPXPXDRW'...], and the Chem.MolFromSequence(seq[i]) returns None for these cases.
My question is how do deal with this kind of sequence?
Let me explain the reason for the output of
NoneAs you can see in this list of abbreviations for peptide sequences the letter "X" stands for "unknown". Basically the real amino acid could not be discovered there. Therefore RDKit can not create a mol object of your data, because parts of it are unknown.
Source of quote above
Since RDKit's managing of this case is logically reasonable you have to answer your question yourself: "How do I deal with unknown amino acids?". You need a preprocessing of those sequences and maybe replace the "X" with something else, or delete that sequence entirely from your dataframe. But this depends on your own usecase.