I am testing the impact of different data formats on hive query efficiency(win10,only my desktop). The original data is 400 txt files of almost the same size (total memory 169MB). I first converted to orc format (130MB), and then converted from orc format to parquet (423MB) and sequencefile (1.87GB). In my understanding, both parquet and sequencefile formats have some compression features, why does the result occupy higher memory than the original format?
Here is some information that I think is relevant: txt:inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false
orc:inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false
parquet:inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat, compressed:false
sequencefile:inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false
The above information is obtained by "describe extended table_name" So what happened?