In GCP - how to identify the no of lines in a file has more than specific delimiter count, by ignoring header & trailer - Python/Bash operator

Eg

Data

HDR|Filename
10|1000|CHN|TVL|TWD
10|1000|CHN|TVL|TWD
10|1000|CHN|TVL|TWD
10|1000|CHN|TVL|TWD
10|1000|CHN|TVL
TRL|Filename

Expected result

Should ignore HDR TRL line

Count : 1 (as the 10|1000|CHN|TVL has only 3 delimiter)

Need to know the efficient way to achieve the function in Airflow operators

1

There are 1 best solutions below

0
kiran mathew On

@Mani Shankar.S, Based on the stack link you mentioned in the comment. Using the gsutil cat bash command we can identify the number of lines in a file that has more than a specific delimiter count, by ignoring the header & trailer .

bash_operator = BashOperator(
   task_id='mani_bash',
   bash_command="""if [ `gsutil cat gs://<bucketname>/<location>/filename.txt | awk -F: '/^[^HDR][^TRL]/ { print }' | awk -F "|" '{print NF-1}' | uniq | wc -l` -eq 1 ];
then
if [ `gsutil cat gs://<bucketname>/<location>/filename.txt | awk -F: '/^[^HDR][^TRL]/ { print }' | awk -F "|" '{print NF-1}' | uniq` -eq 9 ]; then
echo 'rite';
fi;
else
echo 'not rite';
fi""",
)

Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.

Feel free to edit this answer for additional information.