Im trying to read and write tar.gz files from memory using python. I have read over the relevant python docs and have come up with the following minimum working example to demonstrate my issue.
text = "This is a test."
file_name = "test.txt"
text_buffer = io.BytesIO()
text_buffer.write(text.encode(encoding="utf-8"))
tar_buffer = io.BytesIO()
# Start a tar file with the memory buffer as the "file".
with tarfile.open(fileobj=tar_buffer, mode="w:gz") as archive:
# We must create a TarInfo object for each file we put into the tar file.
info = tarfile.TarInfo(file_name)
text_buffer.seek(0, io.SEEK_END)
info.size = text_buffer.tell()
# We have to reset the data frame buffer as tarfile.addfile doesn't do this for us.
text_buffer.seek(0, io.SEEK_SET)
# Add the text to the tarfile.
archive.addfile(info, text_buffer)
with open("test.tar.gz", "wb") as f:
f.write(tar_buffer.getvalue())
# The following command works fine.
# tar -zxvf test.tar.gz
archive_contents = dict()
# Start a tar file with the memory buffer as the "file".
with tarfile.open(fileobj=tar_buffer, mode="r:*") as archive:
for entry in archive:
entry_fd = archive.extractfile(entry.name)
archive_contents[entry.name] = entry_fd.read().decode("utf-8")
The odd thing is that extracting the archive with the tar command works completely fine. I see a file test.txt containing the string This is a test..
However for entry in archive immediately finishes as it seems there are no files in the archive. archive.getmembers() returns an empty list.
One other odd issue is when I set mode="r:gz" when opening the byte stream I get the following exception
Exception has occurred: ReadError
empty file
tarfile.EmptyHeaderError: empty header
During handling of the above exception, another exception occurred:
File ".../test.py", line 283, in <module>
with tarfile.open(fileobj=tar_buffer, mode="r:gz") as archive:
tarfile.ReadError: empty file
I have also tried creating a test.tar.gz file using the tar command (assuming that they may be some issue in the way I was writing the tar file), but I get the same exception.
I must be missing something basic, but I can't seem to find any examples of this online.
You need to reset the position of the buffer to the beginning before you can extract the files because after writing to the tar_buffer, its position is at the end of the file. Therefore, when you try to read from it, there are no files to extract