Read in sheet names only from Excel using pyspark.pandas

416 Views Asked by At

I have about 30 Excel files that I want to read into Spark dataframes, probably using pyspark.pandas.

I am trying to read them like this:

import pyspark.pandas as ps
my_files = ps.DataFrame(dbutils.fs.ls("abfss://my_container@my_env.dfs.core.windows.net/my_files/")

dataframes = []

def read_and_format_file(input_path):
  xl = ps.read_excel(input_path, sheet_name = "Aggregate Data USD", skiprows = 5)
  dataframes.append(xl)

my_files['path'].apply(read_and_format_file)

My problem is that I do not know the name of the sheet within each workbook. It can vary. The only pattern I can use is that the name will have 'Data USD' in it. The first sheet it might be called 'Fred Bloggs Data USD', then the second one 'John Smith Data USD' etc. So I think I need a way to check the sheet names before using read_excel, and then only read the sheet I want.

Open to ideas please. Thanks.

1

There are 1 best solutions below

3
Alejandro23 On

I think you can try the following to get all the sheet names for each file using pandas, you must modify your function to get Sheet names as parameters

for file in dbutils.fs.ls("abfss://my_container@my_env.dfs.core.windows.net/my_files/"):
   xl = ps.ExcelFile(file['path'])
   sheet_names = xl.sheet_names
   read_and_format_file(file['path'], sheet_names)