I do a test, I stored data date by date and in month partition. But when I read this data to python
for the_date in sorted(all_dates):
temp = data[data.tradedate == the_date].copy()
temp.to_parquet(
ParquetFile,
engine="pyarrow", # 推荐 pyarrow
compression="gzip",
partition_cols='month',
)
it comes:
tradedate
tradedate_index
2024-01-23 2024-01-23
2024-01-23 2024-01-23
2024-01-23 2024-01-23
2024-01-23 2024-01-23
2024-01-23 2024-01-23
...
2024-03-06 2024-03-06
2024-03-06 2024-03-06
when I sorted it in python:
data_new.sort_values('tradedate')
it comes:
tradedate
tradedate_index
2024-01-01 2024-01-01
2024-01-01 2024-01-01
2024-01-01 2024-01-01
2024-01-01 2024-01-01
2024-01-01 2024-01-01
...
2024-03-15 2024-03-15
2024-03-15 2024-03-15
which means the data is not the order when I stored them. I want to know why and whether it would harm the performance.