I would like to get a fastq file comprised of fastq reads that are in F1.fastq but are not in F2.fastq.
The following code was offered as a solution here.
needed_reads = []
reads_array = []
chosen_array = []
for x in SeqIO.parse("F1.fastq","fastq"):
reads_array.append(x)
for y in SeqIO.parse("F2.fastq","fastq"):
chosen_array.append(y)
needed_reads = list(set(reads_array) - set(chosen_array))
output_handle = open("DIFF.fastq","w")
SeqIO.write(needed_reads,output_handle,"fastq")
output_handle.close()
However, it no longer works and produces a "unhashable type: SeqRecord" error when the set() function is run on the output of SeqIO.parse(). SeqIO.parse() still produces a list type object according to the type() function. However, that list appears to be unhashable. How can this be fixed?
Briefly, instead of
BioPython, useseqtkand UNIX tools, as shown here:How to remove a list of reads from fastq file?
First, use
grepto extract read ids from fastq file. Usecommto extract only the read names to keep:To extract reads from fastq files by IDs, use
seqtk subseq:It works with
fastafiles as well.Alternatively, use
BBMap filterbyname.sh, see docs:See also:
To install these tools, use
conda, specificallyminiconda, for example:Here, GNU
grepuses the following options:-P: Use Perl regexes.-o: Print the matches only (1 match per line), not the entire lines.\K: Cause the regex engine to "keep" everything it had matched prior to the\Kand not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.See also:
grepmanualseqtkcondaconda createFiles used in the above example: