In Python, I am trying to break a SMILES string into a list of valid SMILES elements. I wanted to ask if RDKit already has a method to do this kind of deconstruction of the SMILES string? I DO have created a list of valid SMILES elements separately.
For example, I want to convert this string CC(Cl)c1ccn(C)c1 into this list ['C', 'C', '(', 'Cl', ')', 'c', '1', 'c', 'c', 'n', '(', 'C', ')', 'c', '1']. Unfortunately, this is not as straightforward as simply getting the characters from the string: occurrence of a lower case alphabet could either mean that it is an element denoted by more than one character (like Cl for Chlorine) or indicate that the element is part of an aromatic ring (like n for Nitrogen). Examples of other valid SMILE elements that are not a single character are Mg, Ca, Uub, %13, +2, @@, etc.
Before I write a parsing algorithm to accomplish this, which I think would be less than ideal because I might miss a SMILES rule here and there (I am neither an expert at SMILES, nor at parsing). For example, occurrence of two digit numbers are another complication that I know I will have to deal with when creating my own parsing algorithm.
Here's an rdkit solution:
this outputs
['C', 'C', 'Cl', 'c', 'c', 'c', 'n', 'C', 'c']It works as follows: the SMILES gets parsed without sanitization (otherwise it wold turn C1=CC=CC=C1 into c1ccccc1 etc). then, it loops through every atom, creates an RWmol instance, adds that one atom to it, and then convert to single atom SMILES,repeat for every atom.
Note that the numbers "1" and "1" are not in the list, because in the SMILES string these correspond to ring opening/closures, not atoms