Generating synthetic data for .ORC file in python

33 Views Asked by At

I am attempting to generate synthetic data in Python and store it in an ORC file using various compression and encoding techniques. I am using PyORC to write the data, but it lacks any option for encoding. For example, if I want to perform RLE encoding, there is no option for that. Even if I try to write my code for RLE, it still does not work as the PyORC writer requires binary-type data to be written. Do you have any suggestions for another Python module or any other solution regarding the encoding part?

def write_orc_data(self, num_rows, output_directory, file_name):
        os.makedirs(output_directory, exist_ok=True)
        data = generate_orc_data(num_rows)
        
        file_path = os.path.join(output_directory, f"{file_name}.orc")
        
        compression = None
        if self.compression_type == 'zlib':
            compression = CompressionKind.ZLIB
        elif self.compression_type == 'snappy':
            compression = CompressionKind.SNAPPY
        elif self.compression_type == 'lzo':
            compression = CompressionKind.LZO
        try:
            with pyorc.Writer(open(file_path, 'wb'), 
                            schema=self.schema,
                            compression=compression,
                            compression_strategy=CompressionStrategy.SPEED, 
                            struct_repr=StructRepr.DICT) as writer:
                with tqdm(total=num_rows, desc=f"Writing ORC Records ({self.compression_type}", unit="records") as pbar:
                    for record in data:
                        writer.write(record)
                        pbar.update(1)
            
            print(f"ORC data written to {file_path} with {self.compression_type} compression")
        except Exception as e:
            print(f"Error writing ORC data: {e}")
0

There are 0 best solutions below