Best method in Python to find a longest string subset against a list of multiple options per character

177 Views Asked by At

I have a simple string, and a list of sets, where each set is a position with 2 possible characters, which looks something like:

"AGTCG"

[('A', 'T'), ('C', 'B'), ('G', 'T'), ('T', 'X'), ... ]

Where I want to find the longest match. In this example it would be "TCG". Each set will never have more than 2 characters. The best solution I have come up with is generating every single possible string using the combinations of characters (ACGT..., ACGX..., ACTT..., etc.), then using difflab SequenceMatcher.find_longest_match and finding the largest results. I suspect there's a better method, but struggling to find other options. Is there a better way?

1

There are 1 best solutions below

0
kedingt On

Kind of resolved. Not an exact solution, but rather a solution to my larger goal, which allows be to bypass this step. The goal was to produce an alignment, so if the set only had one character per position such as ACGTACGT, the string 'AGTCG' would align as best as possible to produce A-GT-CG-. After reflecting, rather than looking for the largest match in a recursive algorithm, the best method would be to utilize a modified version of the Smith-Waterman algorithm.

https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm