How do I exclude a filename that contains '_All_Data' when using the Data Transfer feature on BigQuery?

307 Views Asked by At

I'm using Google Cloud Storage as a source and BigQuery as the destination for a Data Transfer feature that's available in BigQuery.So, on the data source details when creating a new data transfer I need to input the 'Cloud Storage URI', but I don't know how to exclude files with a certain text such as '_All_Data' on the file name.

I've tried to use the pattern bucket/filepath1/2023/*/*/filename_2023*!_All_Data.csv which was provided by ChatGPT, but it didn't work. I was expecting that I would be able to extract all the files in 2023 with the filename starting with 'filename_2023', and exclude all the files that contains '_All_Data'.

Basically, my expectations is that I would be able to extract files such as:

  • bucket/filepath1/2023/05/03/filename_20230503.csv
  • bucket/filepath1/2023/05/02/filename_20230502.csv
  • bucket/filepath1/2023/03/03/filename_20230303.csv

And exclude files such as:

  • bucket/filepath1/2023/05/03/filename_20230503_All_Data.csv
  • bucket/filepath1/2023/05/02/filename_20230502_All_Data.csv
  • bucket/filepath1/2023/03/03/filename_20230303_All_Data.csv
1

There are 1 best solutions below

5
Bihag Kashikar On

give this a try. I checked this worked. basically you will have to copy the required files to another bucket and then use data transfer service from this new bucket.

the grep -v 'All' will ensure the file names having All in it will be excluded and other files will be copied to target-gcs-bucket

gsutil ls gs://your-source-gcs-bucket/*.* | grep -v 'All' | gsutil cp -I gs://your-target-gcs-bucket/