I'm working through CS50 currently. On the current assignment we are given a csv with columns: people, aatg(repetitions of dna subsequences one after another),ccag, etc.. We are also given a dna sequence of letters. I feel pretty confident with my code (though i may have needlessly complicated things?). When at the end it compares a persons maximum repetitions of a certain subsequence, with the maximum repetitions of a subsequence found from the database and always equates as not matching.
I've debugged it and all my variables have what i expect, and when comparing the two im on the two proper dictionarie values. But it always reads true. Though it shouldnt if both are equal. Wheres my logic error????
Sorry if over explained but figured its better than under. I've run through it with a debugger as well and it shed no light. It should equate as being the same but isnt.
Also assume my longest_match function is correct. it correctly returns the longest sequence of subsequences as an int, and stores it as agtc: '4'(means a sequence of 4 atgc's in a row). so runs is a list of dicts ( [ {agtc: 4}, {ttac: 8) .. etc]. It then compares the runs[key] with bob[key](also being an int representing the longest sequence of subsequences) The != comparison is at end of code.
import csv
import sys
def main():
# Check for command-line usage
if len(sys.argv) != 3:
print("Usage: python python.py ____.csv ___.csv")
sys.exit(1)
# Read database file into a variable
databaseName = sys.argv[1]
DNASequence = sys.argv[2]
with open(databaseName, 'r') as file:
reader = csv.DictReader(file)
people = []
subSeq = reader.fieldnames
for row in reader:
people.append(row)
# Read DNA sequence file into a variable
with open(DNASequence, 'r') as file:
sequence = file.read()
# Find longest match of each STR in DNA sequence
runs = []
for i in range(1, len(subSeq)):
sub = {subSeq[i]: longest_match(sequence, subSeq[i])}
runs.append(sub)
# Really proud of this one ^^^
# Check database for matching profiles
for dict1 in people:
match = True
for i in range(1, len(subSeq)):
current_key = subSeq[i]
if dict1[current_key] != runs[i - 1][current_key]:
match = False
if match == True:
print(f"Match found: {dict1[subSeq[0]]}")
if match == False:
print("No match found.")
return
def longest_match(sequence, subsequence):
"""Returns length of longest run of subsequence in sequence."""
# Initialize variables
longest_run = 0
subsequence_length = len(subsequence)
sequence_length = len(sequence)
# Check each character in sequence for most consecutive runs of subsequence
for i in range(sequence_length):
# Initialize count of consecutive runs
count = 0
# Check for a subsequence match in a "substring" (a subset of characters) within sequence
# If a match, move substring to next potential match in sequence
# Continue moving substring and checking for matches until out of consecutive matches
while True:
# Adjust substring start and end
start = i + count * subsequence_length
end = start + subsequence_length
# If there is a match in the substring
if sequence[start:end] == subsequence:
count += 1
# If there is no match in the substring
else:
break
# Update most consecutive matches found
longest_run = max(longest_run, count)
# After checking for runs at each character in seqeuence, return longest run found
return longest_run
main()
CSV sample file:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
dna sequence:
AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
With these two files bob should be the resulting match.