I am running pyspark notebook in Azure databricks environment and auto scaling cluster(2 to 32). I have 2 dataframes df1 and df2 and concatenating using the below code using Pandas.
df1 -> 9 columns and around 11 million records df2-> exactly similar schema as df1, 13 million records I am concatenating using pd.concat(df1,df2,IgnoreIndex=True) and is working fine. Now in order to distribute, I am converting Pandas into Koalas, I converted df1 and df2 to Koalas dataframe. But when I concatenate using ks.concat(df1,df2,IgnoreIndex=True),it always gives the below error.
Job aborted due to stage failure: Task 5 in stage 226.0 failed 4 times, most recent failure: Lost task 5.3 in stage 226.0 (TID 17592) (10.54.144.21 executor 125): org.apache.spark.SparkException: Checkpoint block rdd_593_5 not found! Either the executor that originally checkpointed this partition is no longer alive, or the original RDD is unpersisted. If this problem persists, you may consider using rdd.checkpoint() instead, which is slower than local checkpointing but more fault-tolerant
Any help would be much appreciated.
Thanks, Nikesh