I am working with a genome sequence of a bird and I have run the genome through RepeatMasker. I want to find the longest sequences in each class of repeats. How do I list my repeats according to the length of the sequences?
>rnd-4_family-127#LTR/ERV1
GTTGCCTTTTTCCCAACCTGGAAATGAAAC[...]
>rnd-4_family-1329#Unknown
TCTATCACTTCGGCCCGCGCCAGGAGTGG [...]
the > indicates a new seq and I want something like
>rnd-4_family-127#LTR/ERV1
112
I want the length of each sequence like this and then save it in some file. so that I can then sort this file according to the length of each sequence (e.g. order of increasing length)
Perhaps something like this, assuming that there are only two alternating types of lines in the file: the name (starting with
>) and the sequence. Every sequence must be a single line only and come directly after the name:If the sequence can be split across multiple lines, it becomes a bit trickier, but still managable:
Alternatively, if all names are unique: