I am expected to solve the Overlap Alignment Problem where I have as: Input: A match reward, a mismatch penalty, an indel penalty, and two nucleotide strings v and w. Output: The score of an optimal overlap alignment of v and w, followed by an alignment of a suffix v' of v and a prefix w' of w achieving this maximum score.
An overlap alignment of strings v = v1 ... vn and w = w1 ... wm is a global alignment of a suffix of v with a prefix of w. An optimal overlap alignment of strings v and w maximizes the global alignment score between an i-suffix of v and a j-prefix of w (i.e., between vi ... vn and w1 ... wj) among all i and j.
def overlapAlignment(v, w, match_reward, mismatch_penalty, indel_penalty):
len1, len2 = len(v), len(w)
S = [[0 for j in range(len2 + 1)] for i in range(len1 + 1)]
backtrack = [[0 for j in range(len2 + 1)] for i in range(len1 + 1)]
maxScore = -float('inf')
maxIndex = None
for i in range(1, len1 + 1):
for j in range(1, len2 + 1):
# Match = match_reward, Mismatch/Indel = indel_penalty
scoreList = [S[i - 1][j - 1] + (match_reward if v[i - 1] == w[j - 1] else - mismatch_penalty),
S[i - 1][j] - indel_penalty, S[i][j - 1] - indel_penalty, 0]
S[i][j] = max(scoreList)
backtrack[i][j] = scoreList.index(S[i][j])
if j == len2:
if S[i][j] >= maxScore:
maxScore = S[i][j]
maxIndex = (i, j)
a, b = maxIndex
align1, align2 = v[:a], w[:b]
while a > 0 and b > 0:
if backtrack[a][b] == 0:
a -= 1
b -= 1
elif backtrack[a][b] == 1:
a -= 1
align2 = align2[:b] + '-' + align2[b:]
else:
b -= 1
align1 = align1[:a] + '-' + align1[a:]
align1 = align1[a:]
align2 = align2[b:]
return str(maxScore), align1, align2
v = "GAGA"
w = "GAT"
match_reward = 1
mismatch_penalty = 1
indel_penalty = 2
score, aligned_v, aligned_w = overlapAlignment(v, w, match_reward, mismatch_penalty, indel_penalty)
print("Alignment Score:", score)
print("Aligned v:", aligned_v)
print("Aligned w:", aligned_w)`
This is a code that I generated however it doesn't work in all instances. I tried modifying it for hours already but did not manage to make it work. Basically the problematic inputs are:
- Input: 1 1 5 ATCACT ATG
- Output: 0 CT AT
and
Input: 2 3 1 CTT AGCATAAAGCATT
Output: 0 --CT-T AGC-AT
There are of course more instances when this code doesn't produce desired outputs but these are the examples I have from the debug dataset.
Example of correct output would be the one for the input that is already in the code and for example: Input: 3 2 1 CAGAGATGGCCG ACG
Output: 5 -CG ACG