How to add '.option("overwriteSchema", "true")' to saveAsTable() in SparkR

618 Views Asked by At

How can I save an R dataframe with SparkR::saveAsTable() again under the same name as an already existing table after changing columns?

I am working with R on databricks and saved an R dataframe table_x as table in the database, using this:

data_x <- SparkR::createDataFrame(table_x)
SparkR::saveAsTable(data_x, tableName="table_x", mode = "overwrite")

Later I added columns to the table and changed some column names as well. When I try to save it again, it does not work and the error message says there is a schema mismatch, even if I removed table_x from the database with "drop table".

If I try

data_x <- SparkR::createDataFrame(table_x)
SparkR::saveAsTable(data_x, tableName="table_x", mode = "error")

the error message says: "The associated location (...) is not empty and also not a Delta table."

So even if I dropped table_x, its location is not empty and its table schema is still there?

Below the error message it says:

To overwrite your schema or change partitioning, please set:
'.option("overwriteSchema", "true")'.

How can I do this in SparkR::saveAsTable? The RDocumentation site says that additional options can be passed to the method, but how exactly would I do this?

1

There are 1 best solutions below

0
Vivek Atal On

Very simple, you just add this argument overwriteSchema = "true" to the function SparkR::saveAsTable. Check out the documentation for ... argument of SparkR::saveAsTable function, which says "additional option(s) passed to the method".

If you are not aware of how ... argument works, check Hadley Wickham's book on Advanced R, https://adv-r.hadley.nz/functions.html#fun-dot-dot-dot

The observation here is that, by default, all the Spark dataframes are saved in Delta format instead of Parquet, and Delta format enforces schema consistency as one of many benefits.

Worked out example:

data_x <- SparkR::createDataFrame(iris)
SparkR::saveAsTable(data_x, tableName="table_x", mode = "overwrite")

# modify dataframe
data_x_mod <- data_x |> 
  SparkR::withColumn("col_x", SparkR::lit(1)) |>
  SparkR::withColumn("Sepal_Length", data_x$Sepal_Length * 2)

# try to overwrite at same location - results in ERROR
SparkR::saveAsTable(data_x_mod, tableName="table_x", mode = "overwrite")
# Error in handleErrors(returnStatus, conn) : 
#  org.apache.spark.sql.AnalysisException: A schema mismatch detected when writing to the Delta table ...

# Add mergeSchema or overwriteSchema option as needed - executes WITHOUT ERROR
SparkR::saveAsTable(data_x_mod, tableName="table_x", mode = "overwrite", overwriteSchema = "true")