So, I'm a beginner and learning spark programming (pyspark) on Databricks -
What am I trying to do ?
List all the files in a directory and save it into a dataframe so that I am able to apply filter, sort etc on this list of files. Why ? Because I am trying to find the biggest file in my directory.
Why doesn't below work ? What am I missing ?
from pyspark.sql.types import StringType
sklist = dbutils.fs.ls(sourceFile)
df = spark.createDataFrame(sklist,StringType())
ok, actually, I figured it out :). Just wanna leave the question here incase some one benefits from it.
So basically, the problem was with the schema. Not all the elements in the list was of String Type. So I explicitly created a schema and used it in createDataFrame function.
Working code -