awk command to split an 8GB file into multiple files basis number of rows with new filename and header in each file

460 Views Asked by At

awk command to split an 8GB file into multiple files basis number of rows with new filename and header in each file

I have an 8GB file with 26 column headers. I have to split it into multiple files with each file having 400000 lakhs including header. which means each file should have the header as well.

I have tried multiple commands but even though I am getting the desired output there is one small problem but a weird one.

After the 1st line as the header,the header is inserted again at every 50000th line. For eg after using the below command, I got FileName_28062021_1.txt file. If I open this file I can see the header in 1st , 50001st,100001st,150001st lines: Not sure how to resolve it. Original Command tried:

awk '
    NR==1{header=$0; count=1; print header > "FileName_28062021_" count ".txt"; next }
    !( (NR-1) % 399999){count++; print header > "FileName_28062021_" count ".txt";}
    {print $0 > "FileName_28062021_" count ".txt"}
' FileName_28062021-SourceFile.txt
    
SERVERIF:/data1/tempCheckAWK $ wc -l FileName_28062021-NonSplit.txt
46646575 FileName_28062021-NonSplit.txt

Second AWK command tried

SERVERIF:/data1/tempCheckAWK $ vi tempAWK.sh
awk '
NR==1 { header = $0 }
(NR % 400000) == 1 {
close(out)
out = "FileName_28062021_" (++count) ".txt"
print header > out
}
NR>1 { print > out }
' FileName_28062021-NonSplit.txt

SERVERIF:/data1/tempCheckAWK $ sh tempAWK.sh
SERVERIF:/data1/tempCheckAWK $ ls -ltr
Jun 10 13:43 FileName_28062021-NonSplit.txt
Jun 28 23:56 tempAWK.sh
Jun 28 23:59 FileName_28062021_1.txt
Jun 28 23:59 FileName_28062021_2.txt

....

SERVERIF:/data1/tempCheckAWK $wc -l FileName_28062021_1.txt
400000 FileName_28062021_1.txt

SERVERIF:/data1/tempCheckAWK $grep "Transactions Id" FileName_28062021_1.txt
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code

I have tried other solutions provided on stackoverflow. Still no luck, the header repeats after it repeats after 50000th

2

There are 2 best solutions below

0
user16334809 On BEST ANSWER
So when I executed the below command to check the number of occurrences of the header in the input file. it gave me lots of records as given below. So the issue was not there in the AWK command but the input file itself. 

SERVERIF:/data1/tempCheckAWK $grep -n "Transactions Id" FileName_28062021-NonSplit.txt
    1:Transactions Id|Transaction Date|Investment Id|External Code
    50001:Transactions Id|Transaction Date|Investment Id|External Code
    100001:Transactions Id|Transaction Date|Investment Id|External Code
    150001:Transactions Id|Transaction Date|Investment Id|External Code
11
Ed Morton On

Aside from the issue you noticed, your existing script will fail with a syntax error in some awks due to the unparenthesized expression on the right side of output redirection, and it'll fail with a "too many open files" error in some other awks due to not closing the output files as you go.

Do something like this, untested:

awk '
    NR==1 { header = $0 }
    (NR % 400000) == 1 {
        close(out)
        out = "FileName_28062021_" (++count) ".txt"
        print header > out
    }
    NR>1 { print > out }
' FileName_28062021-SourceFile.txt

if you didn't want to hard-code parts of the output file name but instead generate it from the input file name then change:

out = "FileName_28062021_" (++count) ".txt"

to

out = FILENAME
sub(/-[^-.]+/,"_"(++count),out)

or similar.

After more discussion with the OP the problem of repeated header lines in the output was due to repeated header lines in the input.