Reverse string in specific fields with condition

176 Views Asked by At

I have this file:

m64071_220512_054244/12584899/ccs rev pet047-10055 ACGTGCGACCTTGTGA TTGAGGGTTCAAACGTGCGACCTTGTGA
m64071_220512_054244/128321000/ccs rev pet047-10055 ACGTGCGACCTTGTGA TTGAGGGTTCAAACGTGCGACCTTGTGA
m64071_220512_054244/132186699/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/134874748/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA

I need to tr and reverse the fields $4 and $5 only if $2==rev

Expect:

m64071_220512_054244/12584899/ccs rev pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/128321000/ccs rev pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/132186699/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/134874748/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA

I tried:

perl -lpe 'if(/rev/) {$rev=/rev/;next}; if ($rev) {$F[4,5]=~tr/ATGC/TACG/; $F[4,5]=reverse $F[4,5]; print "@F"}' file 

I also tried to go with Awk (Execute bash command inside awk and print command output)

awk '{
            if($2==rev)
        {
            cmd1="echo \047" $4 "\047 | rev | tr \047ATGC\047 \047TACG\047" 
            cmd2="echo \047" $5 "\047 | rev | tr \047ATGC\047 \047TACG\047"
            newVar1=((cmd1 | getline line) > 0 ? line : "failed") 
            newVar2=((cmd2 | getline line) > 0 ? line : "failed")
            close(cmd)
            print $1, $2, $3, newVar1, newVar2
        }
        else {print}
}' file
5

There are 5 best solutions below

0
zdim On BEST ANSWER

To follow the question's attempt:

perl -w -lanE'
    if ($F[1] eq "rev") { 
        for (@F[3,4]) { tr/ATGC/TACG/; $_ = reverse $_ } 
    } 
    say "@F"
' file

Can put this on one line, I spread it out for readability (or copy-paste it as is into most shells). Or, put it in a program file of course, specially if there is more to do.

The code in the block of that loop over @F[3,4] elements runs on each element of the array and modifies it in place (since $_ is just an alias to array elements), first with tr then by (reversing and) assigning. All that can also be written as

$_ = reverse tr/ATGC/TACG/r  for @F[3,4];

The r modifier makes tr return the changed string, which is then reverse-ed and then assigned back to the currently processed array element, via the $_ alias for it

Comments on the code in the question

  • To have it break up the input string into the @F array ("autosplit") need -a flag

  • Since you are explicitly printing what is needed use -n flag, not -p

  • Fields 4 and 5 in the line are array elements 3 and 4

  • I assume that by $F[4,5] you mean the two array elements (which should be 3,4). That, then, should be @F[3,4] -- and with -w flag, for warnings, we'd hear about it

  • More importantly, we cannot bind a regular expression or a tr pattern to a list, but only to a single scalar. In order to apply that tr to multiple items need to iterate over them, like above


With @ary =~ /.../ the LHS is first evaluated so we get something like

$ary[0], ..., $ary[-1] =~ /.../;

All terms up to the last one are evaluated in void context and discarded (see the comma operator for the logic of it), one by one and with a warning (if warnings are on!), and then $ary[-1] =~ /.../ is evaluated and its return returned from the whole expression.

The same goes for LIST =~ /.../; (with parens around LIST or not)

0
Alex Reynolds On

Here's a way to do this in Python:

#!/usr/bin/env python

import io
import sys

RECORDS_STR = '''m64071_220512_054244/12584899/ccs rev pet047-10055 ACGTGCGACCTTGTGA TTGAGGGTTCAAACGTGCGACCTTGTGA
m64071_220512_054244/128321000/ccs rev pet047-10055 ACGTGCGACCTTGTGA TTGAGGGTTCAAACGTGCGACCTTGTGA
m64071_220512_054244/132186699/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/134874748/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA'''

'''
fast pure-Python reverse complement, courtesy of Devon Ryan
ref. https://bioinformatics.stackexchange.com/a/3585/776
'''
DNA_TABLE = str.maketrans("ACTGactg", "TGACtgac")
def reverse_complement(seq):
    return seq.translate(DNA_TABLE)[::-1]

def main():
    records = io.StringIO(RECORDS_STR) # replace with sys.stdin etc.
    for line in records:
        elems = line.rstrip().split()
        if elems[1] == 'rev':
            elems[3] = reverse_complement(elems[3])
            elems[4] = reverse_complement(elems[4])
        sys.stdout.write('{}\n'.format('\t'.join(elems)))

if __name__ == "__main__":
    main()

Output:

m64071_220512_054244/12584899/ccs   rev pet047-10055    TCACAAGGTCGCACGT    TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/128321000/ccs  rev pet047-10055    TCACAAGGTCGCACGT    TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/132186699/ccs  fwd pet047-10055    TCACAAGGTCGCACGT    TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/134874748/ccs  fwd pet047-10055    TCACAAGGTCGCACGT    TCACAAGGTCGCACGTTTGAACCCTCAA
0
Kaz On

In TXR Lisp, using the awk macro:

(awk ((equal [f 1] "rev")
      (each ((i 3..5))
        (upd [f i] (mapcar (relate "ATGC" "TACG")) reverse)))
     (t))
$ txr rev.tl data
m64071_220512_054244/12584899/ccs rev pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/128321000/ccs rev pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/132186699/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/134874748/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA

$ diff <(txr rev.tl data) expected
$
3
Supertech On

If you want to try Bipython (I highly recommend when you work with nucleic acid sequences), here is one way to do it:

from Bio.Seq import Seq

with open("input_file.txt") as f:
    for line in f:
        line = line.rstrip() # remove new line
        fields = line.split()
        if fields[1] != 'rev':
            print(line)
        else:
            dna_col3 = Seq(fields[3])
            dna_col4 = Seq(fields[4])
            print(' '.join(fields[0:3]), dna_col3.reverse_complement(), dna_col4.reverse_complement())
0
Jesse C. On

Using any awk:

awk '
{if ($2=="rev") {
    {for (i=0;i<=length($4)-1;i++) {{letter = substr($4,length($4)-i,1)}
        {sub(/A/,"T",letter) || sub(/T/,"A",letter)}
        {sub(/G/,"C",letter) || sub(/C/,"G",letter)}
        {f4 = f4 letter}}}
    {for (i=0;i<=length($5)-1;i++) {{letter = substr($5,length($5)-i,1)}
        {sub(/A/,"T",letter) || sub(/T/,"A",letter)}
        {sub(/G/,"C",letter) || sub(/C/,"G",letter)}
        {f5 = f5 letter}}}
    {print $1,$2,$3,f4,f5}
    {f4 = "" }
    {f5 = "" }
} else print $0}
' input.txt