Creating Databricks Database Snapshot

480 Views Asked by At

I have a database created in my Databricks environment which is mounted to an AWS S3 location. Is there a way to take the snapshot of the database so that I can store it to different place and restore it in case of any failure?

1

There are 1 best solutions below

2
John Rotenstein On

Databricks is not like a traditional database where all data is stored "inside" the database. For example, Amazon RDS provides a "snapshot" feature that can dump the entire contents of a database, and the snapshot can then be restored to a new database server if required.

The equivalent in Databricks would be Delta Lake time travel, which allows you to access the database as it was at a previous point-in-time. Data is not "restored" -- rather, it is simply presented as it previously was at a given timestamp. It is a snapshot without the need to actually create a snapshot.

From Configure data retention for time travel:

To time travel to a previous version, you must retain both the log and the data files for that version.

The data files backing a Delta table are never deleted automatically; data files are deleted only when you run VACUUM. VACUUM does not delete Delta log files; log files are automatically cleaned up after checkpoints are written.

If, instead, you do want to keep a "snapshot" of the database, a good method would be to create a deep clone of a table, which includes all data. See:

I think you would need to write your own script to loop through each table and perform this operation. It is not as simple as clicking the "Create Snapshot" button in Amazon RDS.