How to parse and get known individual elements, not characters, from a smiles string in Python

399 Views Asked by At

In Python, I am trying to break a SMILES string into a list of valid SMILES elements. I wanted to ask if RDKit already has a method to do this kind of deconstruction of the SMILES string? I DO have created a list of valid SMILES elements separately.

For example, I want to convert this string CC(Cl)c1ccn(C)c1 into this list ['C', 'C', '(', 'Cl', ')', 'c', '1', 'c', 'c', 'n', '(', 'C', ')', 'c', '1']. Unfortunately, this is not as straightforward as simply getting the characters from the string: occurrence of a lower case alphabet could either mean that it is an element denoted by more than one character (like Cl for Chlorine) or indicate that the element is part of an aromatic ring (like n for Nitrogen). Examples of other valid SMILE elements that are not a single character are Mg, Ca, Uub, %13, +2, @@, etc.

Before I write a parsing algorithm to accomplish this, which I think would be less than ideal because I might miss a SMILES rule here and there (I am neither an expert at SMILES, nor at parsing). For example, occurrence of two digit numbers are another complication that I know I will have to deal with when creating my own parsing algorithm.

2

There are 2 best solutions below

0
wikke On

Here's an rdkit solution:

import rdkit
from rdkit import Chem
def get_atom_chars(smi):
    atoms_chars=[]
    mol = Chem.MolFromSmiles(smi,sanitize=False)
    for a in mol.GetAtoms():
        atom=Chem.RWMol()
        atom.AddAtom(a)
        atoms_chars.append(Chem.MolToSmiles(atom))
    return atoms_chars
                             
                             
get_atom_chars("CC(Cl)c1ccn(C)c1")

this outputs

['C', 'C', 'Cl', 'c', 'c', 'c', 'n', 'C', 'c']

It works as follows: the SMILES gets parsed without sanitization (otherwise it wold turn C1=CC=CC=C1 into c1ccccc1 etc). then, it loops through every atom, creates an RWmol instance, adds that one atom to it, and then convert to single atom SMILES,repeat for every atom.

Note that the numbers "1" and "1" are not in the list, because in the SMILES string these correspond to ring opening/closures, not atoms

0
manas On

This can be accomplished by extending the following function (from Molecular Transformer):

import re

def smi_tokenizer(smi):
    """
    Tokenize a SMILES molecule or reaction
    """
    pattern =  "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"
    regex = re.compile(pattern)
    tokens = [token for token in regex.findall(smi)]
    assert smi == ''.join(tokens)
    return tokens

smi_tokenizer("CC(Cl)c1ccn(C)c1")

This will output :

['C', 'C', '(', 'Cl', ')', 'c', '1', 'c', 'c', 'n', '(', 'C', ')', 'c', '1']

Reference :
You might want to take a look at the paper for Molecular Transformer by Schwaller et.al. and the code.