My code kinda works, except its selective in what it works with. It gives the correct name for a specific sequence, but for the others it will mess up.
For example, it will correctly identify that a strand belongs to Bob, but will match a supposed "No Match" strand with "Charlie", who doesn't even exist in the list cs50 gave us.
It's really weird, and I've checked my code against other peoples and they seem to be mostly similar. Don't know why this is happening, hopefully some help please.
import csv
import sys
def main():
# TODO: Check for command-line usage
if len(sys.argv) != 3:
sys.exit("Usage: python dna.py data.csv sequence.txt")
# TODO: Read database file into a variable
database = []
with open(sys.argv[1], 'r') as file:
reader = csv.DictReader(file)
for row in reader:
database.append(row)
# TODO: Read DNA sequence file into a variable
with open(sys.argv[2], 'r') as file:
dna_sequence = file.read()
# TODO: Find longest match of each STR in DNA sequence
subsequences = list(database[0].keys())[1:]
results = {}
for subsequence in subsequences:
match = 0
results[subsequence] = longest_match(dna_sequence, subsequence)
match += 1
# TODO: Check database for matching profiles
for person in database:
for subsequence in subsequences:
if int(person[subsequence]) == results[subsequence]:
match += 1
if match == len(subsequence):
print(person["name"])
return
print("No match")
return
def longest_match(sequence, subsequence):
"""Returns length of longest run of subsequence in sequence."""
# Initialize variables
longest_run = 0
subsequence_length = len(subsequence)
sequence_length = len(sequence)
# Check each character in sequence for most consecutive runs of subsequence
for i in range(sequence_length):
# Initialize count of consecutive runs
count = 0
# Check for a subsequence match in a "substring" (a subset of characters) within
#sequence
# If a match, move substring to next potential match in sequence
# Continue moving substring and checking for matches until out of consecutive matches
while True:
# Adjust substring start and end
start = i + count * subsequence_length
end = start + subsequence_length
# If there is a match in the substring
if sequence[start:end] == subsequence:
count += 1
# If there is no match in the substring
else:
break
# Update most consecutive matches found
longest_run = max(longest_run, count)
# After checking for runs at each character in seqeuence, return longest run found
return longest_run
main()
Are you still working on this? If so, there are 2 databases and 20 sequences to test. (They are listed with correct answers at the end of the DNA PSET.) Which one gives you the error above? I suspect it is the 3rd test. It says Run your program as
python dna.py databases/small.csv sequences/3.txt. Your program should outputNo match.When I do this, your program outputs
Charlieinstead ofNo match.The subsequences you need to check are:
['AGATC', 'AATG', 'TATC']Your subsequence count is:
{'AGATC': 3, 'AATG': 3, 'TATC': 5}That doesn't match anyone in the small.csv file.
Charlie is close, but his DNA subsequence count is:
('AGATC', '3'), ('AATG', '2'), ('TATC', '5')The error occurs when you compare each person to the subsequence counts. There are 3 things to fix:
matchis set the previous loop (for subsequence in subsequences:). In needs to be in thefor person in database:loop.matchneeds to be modified. (this is inside the 2nd forsubsequence in subsequences:loop.)matchagainstlen(subsequence). Think about it....I made those changes and it works for all 4
small.csvtests and the 3large.csvI tried.