Hello I am trying to cluster a pool of 70 DNA sequence fasta files that are around 100 bp's in length (GCAT) in order to compare genotype clusters with my phenotype clusters in order to validate phenotype results.
What do you think is the feasibility of a one hot encoding followed by a kmeans clustering algorithm in order to do this?
Some problems I've come across with current sequence clustering software (DBSCAN, ALFATCLUST) are that they seem to focus primarily on longer dna sequences, meaning that it would always throw all of the sequences into one group, or just the sequences out all together. Generally speaking, these algorithms are also very sensitive to noise (inherent to DNA seq), which often leads to inaccurate clusters.
Any ideas how I can figure out this problem?
I tried to cluster with DBSCAN, kmeans, cmeans, ALFATCLUST, etc. but am getting faulty clusters.