split in lines and take a pattern in a file

60 Views Asked by At

I have many files .txt which looks like:

file1.txt

header
1_fff_aaa 1_rrr_aaa 1_ggg_aaa ...

file2.txt

header
1_ttt_aaa 1_iii_aaa 1_lll_aaa ...

I would like to remove the header and split the string of the second line in multiple lines after the white space and take the pattern in between the _ character:

Output:

file1_v1.txt

fff
rrr
ggg

file2_v1.txt

ttt
iii
lll

I would like to utilise unix commands like sed

3

There are 3 best solutions below

1
Arnaud Valmary On BEST ANSWER

Something like that:

Program: split.awk

NR == 1 {
    # ignore first header line
    next
}
{
    i=1
    while (i <= NF) {
        gsub(/^[^_]*_/, "", $i)
        gsub(/_[^_]*$/, "", $i)
        print $i
        i++
    }
}

Executed like that:

awk -f split.awk file1.txt > file1_v1.txt

To execute it on many files:

for f in file*.txt; do echo "$f"; awk -f split.awk "$f" > "${f%.txt}_v1.txt" ; done

UPDATE

You could also use sed & tr:

sed -n '2,$p' file1.txt | tr " " "\n" | sed 's/^[^_]*_\(.*\)_[^_]*$/\1/'
0
potong On

This might work for you (GNU sed):

sed -i '1d;s/\s\+/\n/g;s/^[^_]*_//mg;s/_.*//mg' file1 file2 file3 ...

Use the command line option -i to replace inline.

Delete the first line of each file (remove the header).

Replace white space(s) to newlines. This will convert each token to a separate line.

Remove the first part of the string upto and including the first _ for all lines in the pattern space.

Remove from the first _ to the end of the line, leaving the result.

N.B. The -i option may be replaced by the -s option if the user only requires output to stdout from one or more files. Also notice the m flag on the last two substitution commands which changes the usual replacement so as to benefit multiline patterns.

To change the output file names, employ GNU parallel:

parallel --plus "sed '1d;y/ /\n/;s/^[^_]*_//mg;s/_.*//mg' {} > {.}_v1.{+.}" ::: file1.txt file2.txt ...
0
Ed Morton On

I wouldn't normally answer a question where the OP hasn't shown any attempt to solve their problem themselves, but since there are multiple answers already...

Using any awk:

$ cat tst.awk
BEGIN { FS="_" }
FNR == 1 {
    close(out)
    out = FILENAME
    sub(/\.txt$/,"_v1&",out)
    next
}
{
    for ( i=2; i<=NF; i+=2 ) {
        print $i > out
    }
}

$ awk -f tst.awk file{1,2}.txt

$ head file{1,2}_v1.txt
==> file1_v1.txt <==
fff
rrr
ggg

==> file2_v1.txt <==
ttt
iii
lll