How to prevent pyspark to read parquet file header record as just another row instead of reading it as header?

548 Views Asked by At

I have a parquet file with 11 columns. I tried executing below ways in pyspark to read the file. It still assigns header names like Prop_0, Prop_1, Prop_2 instead of reading the starting header as header row.

1.

spark.read.parquet("/FileStore/tables/Order.parquet").show()
dfpq_new=spark.read.format("parquet").load("/FileStore/tables/Order-1.parquet")
dfpq_new=spark.read.format("parquet").option("header", True).option("inferSchema", True).load("/FileStore/tables/Order-1.parquet")

headers prop_0 prop_1 instead of header names from parquet file

However, when I create an dataframe and save it as parquet file, and then read it -

data1 = (("Bob", "IT", 4500), \
("Maria", "IT", 4600),  \
("James", "IT", 3850),   \
("Maria", "HR", 4500),  \
("James", "IT", 4500),    \
("Sam", "HR", 3300),  \
("Jen", "HR", 3900),    \
("Jeff", "Marketing", 4500), \
("Anand", "Marketing", 2000),\
("Shaid", "IT", 3850) \
)
col = ["Name", "MBA_Stream", "SEM_MARKS"]
marks_pq_df = spark.createDataFrame(data1, col)
marks_pq_df.write.parquet("/FileStore/table/markspq.parquet", mode='overwrite')

spark.read.format("parquet").load("/FileStore/table/markspq.parquet").show()

reads_headers_from_parquet_file

I am using databricks community edition.

0

There are 0 best solutions below