I am loading a predefined schema from a JSON file for a specific dataset I ingest into a Azure Data Lake. The JSON file that contains the schema is also stored on the Data Lake.
varSchema = 'abfss://landing@[hidden].dfs.core.windows.net/'+parSourceSystemName+'/'+parDatasetName+'.json'
rdd = spark.sparkContext.wholeTextFiles(varSchema)
text = rdd.collect()[0][1]
dict = json.loads(str(text))
dataSchema = StructType.fromJson(dict)
I want to get the number fields in this schema variable so I can compare it to the number columns of a dataframe that was loaded from a file in my landing container to determine whether there is a schema change in the new landing data.
If the Schema states that there should be 20 fields but the landing data file contain 21 - I would know that the source system added a new field.
Create an empty DataFrame with the schema
Load the actual data into another DF
Get the number of fields in the schema and the number of columns in the landing data DF and compare them: (I assumed here that you want print statements)