Extract gz files within gz files in Python

40 Views Asked by At

I have a .gz file which, in and of itself, has multiple .gz files containing the data I wish to extract in XML format. So, in a sort of "route view" it would look like this:

maingzfile/subgzfile1/xmldata1
maingzfile/subgzfile2/xmldata2
maingzfile/subgzfile3/xmldata3
...

Is there a way I can extract all the XML data directly into a new folder?

Thanks in advance.

1

There are 1 best solutions below

0
Earlee On

You can implement this recursively. The idea here is that while it is a .gz file, recursively extract it. You can modify the following function to extract to another location if needed.

import gzip
import shutil
import os

def extract_gz_recursively(gz_file: str):
    # remove .gz ending
    base_name = gz_file[:-3]

    # extract gz file
    with gzip.open(gz_file, 'rb') as file_in:
        with open(base_name, 'wb') as file_out:
            shutil.copyfileobj(file_in, file_out)
            print(base_name + ' file created.')
    
    # if it's still a gz file, recursively extract 
    if base_name.endswith(".gz"): extract_gz_recursively(base_name)


# get all gz archives from a directory
entries = os.scandir(PATH_GOES_HERE)
gz_files = [entry for entry in entries if entry.is_file() and entry.name.endswith(".gz")]

for gz_file in gz_files:
    extract_gz_recursively(gz_file.name)