Read in sheet names only from Excel using pyspark.pandas

416 Views Asked by TheRealJimShady At 06 October 2023 at 20:41

I have about 30 Excel files that I want to read into Spark dataframes, probably using pyspark.pandas.

I am trying to read them like this:

import pyspark.pandas as ps
my_files = ps.DataFrame(dbutils.fs.ls("abfss://my_container@my_env.dfs.core.windows.net/my_files/")

dataframes = []

def read_and_format_file(input_path):
  xl = ps.read_excel(input_path, sheet_name = "Aggregate Data USD", skiprows = 5)
  dataframes.append(xl)

my_files['path'].apply(read_and_format_file)

My problem is that I do not know the name of the sheet within each workbook. It can vary. The only pattern I can use is that the name will have 'Data USD' in it. The first sheet it might be called 'Fred Bloggs Data USD', then the second one 'John Smith Data USD' etc. So I think I need a way to check the sheet names before using read_excel, and then only read the sheet I want.

Open to ideas please. Thanks.

Original Q&A

There are 1 best solutions below

Alejandro23 On 07 October 2023 at 18:03

I think you can try the following to get all the sheet names for each file using pandas, you must modify your function to get Sheet names as parameters

for file in dbutils.fs.ls("abfss://my_container@my_env.dfs.core.windows.net/my_files/"):
   xl = ps.ExcelFile(file['path'])
   sheet_names = xl.sheet_names
   read_and_format_file(file['path'], sheet_names)

Read in sheet names only from Excel using pyspark.pandas

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in PYSPARK-PANDAS

Trending Questions

Popular # Hahtags

Popular Questions