I am attempting to generate synthetic data in Python and store it in an ORC file using various compression and encoding techniques. I am using PyORC to write the data, but it lacks any option for encoding. For example, if I want to perform RLE encoding, there is no option for that. Even if I try to write my code for RLE, it still does not work as the PyORC writer requires binary-type data to be written. Do you have any suggestions for another Python module or any other solution regarding the encoding part?
def write_orc_data(self, num_rows, output_directory, file_name):
os.makedirs(output_directory, exist_ok=True)
data = generate_orc_data(num_rows)
file_path = os.path.join(output_directory, f"{file_name}.orc")
compression = None
if self.compression_type == 'zlib':
compression = CompressionKind.ZLIB
elif self.compression_type == 'snappy':
compression = CompressionKind.SNAPPY
elif self.compression_type == 'lzo':
compression = CompressionKind.LZO
try:
with pyorc.Writer(open(file_path, 'wb'),
schema=self.schema,
compression=compression,
compression_strategy=CompressionStrategy.SPEED,
struct_repr=StructRepr.DICT) as writer:
with tqdm(total=num_rows, desc=f"Writing ORC Records ({self.compression_type}", unit="records") as pbar:
for record in data:
writer.write(record)
pbar.update(1)
print(f"ORC data written to {file_path} with {self.compression_type} compression")
except Exception as e:
print(f"Error writing ORC data: {e}")