Sort FASTA file according to length of sequences

195 Views Asked by At

I am working with a genome sequence of a bird and I have run the genome through RepeatMasker. I want to find the longest sequences in each class of repeats. How do I list my repeats according to the length of the sequences?

>rnd-4_family-127#LTR/ERV1
GTTGCCTTTTTCCCAACCTGGAAATGAAAC[...]
>rnd-4_family-1329#Unknown
TCTATCACTTCGGCCCGCGCCAGGAGTGG [...] 

the > indicates a new seq and I want something like

>rnd-4_family-127#LTR/ERV1
112

I want the length of each sequence like this and then save it in some file. so that I can then sort this file according to the length of each sequence (e.g. order of increasing length)

3

There are 3 best solutions below

0
knittl On

Perhaps something like this, assuming that there are only two alternating types of lines in the file: the name (starting with >) and the sequence. Every sequence must be a single line only and come directly after the name:

awk '
/^>/{name=$0;next}
{print length($0), name, $0}
' file.fasta | sort -n

If the sequence can be split across multiple lines, it becomes a bit trickier, but still managable:

awk '
/^>/&&name{print len,name;len=0}
/^>/{name=$0;next}
{len+=length($0)}
END{print len,name}
' file.fasta | sort -n

Alternatively, if all names are unique:

awk '
/^>/{name=$0;next}
{a[name]+=length($0)}
END{for(name in a){print a[name], name}}
' file.fasta | sort -n
0
Reilas On

"... How do I list my repeats according to the length of the sequences? ..."

Use a class to contain the data.

class Sequence {
    String name, s;

    Sequence(String name, String s) {
        this.name = name;
        this.s = s;
    }

    @Override
    public String toString() {
        return "%s, %s".formatted(s.length(), name);
    }
}

Generate a List.

List<Sequence> l = new ArrayList<>();
try (Scanner in = new Scanner(new File("sequences.txt"))) {
    String s;
    while (in.hasNextLine())
        if ((s = in.nextLine()).startsWith(">"))
            l.add(new Sequence(s, in.nextLine()));
}

Then, sort the List.

l.sort((a, b) -> Integer.compare(b.s.length(), a.s.length()));

Output

30, >rnd-4_family-127#LTR/ERV1
29, >rnd-4_family-1329#Unknown
0
Timur Shtatland On

Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, use seqkit fx2tab/tab2fx to convert fasta to tab-delimited format, write the lengths of sequences, then use sort to sort by sequence length, and finally convert back to fasta:

 seqkit fx2tab --length in.fa | sort -t $'\t' -k4,4n | seqkit tab2fx > out.fa

Note that seqkit suite of tools can be easily installed, for example, using conda, specifically miniconda, like so:

conda create --channel bioconda --name seqkit seqkit

Example:

input:

>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGT

ACGT

>seq3 without blanks or newlines
ACGT

output:

>seq3 without blanks or newlines
ACGT
>seq2 with newlines
ACGTACGT
>seq1 with blanks
ACGT ACGT ACGT