How to keep database and object store consistent to avoid orphan objects?

449 Views Asked by At

I am writing an online text editor. I want to allow users to add inline images and video to the document. I am struggling to implement this in a reliable way.

Current infrastructure:

  • Database (postgres) of documents (text, title, author, list of media objects referencing S3)
  • Object store (S3) where the images/video/files are stored

The current flow:

  1. User creates a new document
  2. User makes changes, but doesn't save it. These changes are stored in localStorage so they are not lost on refresh.
  3. The user attaches an image
  4. The image displays a loading indicator as it is uploaded to S3 (or equivalent)
  5. The user saves the document, and the data is saved to a database. The objects are not saved, only S3 URLs to them.

Problem

  • If the user deletes the document before saving, or if saving fails, there will be orphan files in S3 that are not referenced by any documents.
  • A "delete document" action must now delete something from Postgres and S3. Since you cannot do a transaction across two completely different services, one can imagine a situation where the postgres delete succeeds, but the S3 delete fails, creating more orphan objects.

Attempts at solutions

  • I tried storing the media in localStorage and committing them all when the document is saved. This would solve the issue, but localStorage is limited to 5-10mb, which is too small.
  • A reaper daemon that queries references to S3 in the database and cross-references it with objects stored in S3 to find orphan objects, which it would automatically delete.

The reaper daemon would work, but it feels like a hack. I really don't want to manage an entirely new service just to store some files. Is there a better way to do this? What is the industry standard?

If it matters, I'm using React+Typescript and the text editor is built upon DraftJS.

1

There are 1 best solutions below

0
Kinrany On

Here's the solution to the core problem of keeping the database and the object store consistent.

First, a couple of general rules:

  1. The database is the source of truth. If the object store disagrees with the database, the object store is wrong.
  2. Distributed consistency is easy as long as facts are only ever created, never deleted. See the Keeping CALM paper.

The database stores the following information about the objects:

  1. A unique ID. (Not a hash: two uploads of the same file must get two different IDs. Deduplicating objects via content-addressing is out of scope.)
  2. Upload timestamp. This is set after uploading the object to the object store under the object's ID.
  3. Deletion timestamp. This is set before deleting the object from the object store.

The timestamps are optional but immutable once set.

The object can be used for as long as it has an upload timestamp and doesn't have a deletion timestamp.

It effectively goes through the following states:

  1. Database: doesn't exist. Object Store: doesn't exist.
  2. DB: assigned ID. OS: doesn't exist.
  3. DB: assigned ID. OS: exists.
  4. DB: uploaded. OS: exists.
  5. DB: deleted. OS: exists.
  6. DB: deleted. OS: doesn't exist.
  7. DB: doesn't exist. OS: doesn't exist.

The application needs to perform two operations here: creating an object and deleting an object. Both are idempotent.

Creation:

  1. Assign ID.
  2. Upload object to object store.
  3. Add an upload timestamp.

Deletion:

  1. Add a download timestamp.
  2. Delete object from object store.
  3. Delete the record from the database.

Two destructive updates happen here:

  1. the object is deleted from the object store. This is pure cleanup, as the database already considers the object to be deleted.
  2. the record is deleted from the database. At this point the system is no longer distributed as the object store no longer knows about the object.

Finally, do lightweight sweeping periodically to clean up failed operations:

  1. Look for objects that have been created a long time ago. (Needs an extra timestamp set when the ID is assigned.)
  2. Mark them as both created and deleted. (This is a valid operation regardless of whether they've been saved to the object store.)
  3. Look for objects that have been marked as deleted a long time ago.
  4. Perform the full deletion operation on them again.