Delete old rows in change log for Firestore to BigQuery Extenstion

31 Views Asked by At

I am using the Firebase Extension Stream Firestore to BigQuery to export my data to BQ. It is working great, and I have been using it for a couple of years now. However, the table size is starting to get larger causing an increase in cost for my queries. I have the following tables created by this and npx @firebaseextensions/fs-bq-schema-views:

  • mytable_raw_changelog (1,327,594 rows)
  • mytable_raw_latest (55,329 rows)
  • mytable_schema_changelog (1,332,454 rows)
  • mytable_schema_latest (55,745 rows, 3.2GB query)

A single SELECT from the mytable_schema_latest table now processes 3.2GB

The options I can think of are

  1. Remove old rows from the _changelog that are no longer needed
  2. Add time partitioning using the Firebase BQ extension config
  3. Create a job to copy the mytable_schema_latest to another table for querying (without the historic changes)

I am looking to go with Option 1 - remove old rows for now as this seems the simplest. Is there a recommended way or any guides on how to do this?

I only want to remove old historic changes that occurred before a certain date where the document has since been updated.

It may also be a good idea to copy these to a historic table before deleting them in case I need them in the future.

Update 204-03-28: I noticed there is a new parameter with the latest version of the extension. Not much info but maybe this will do the trick?

Exclude old data payloads If enabled, table rows will never contain old data (document snapshot before the update), which should be more performant, and avoid potential resource limitations.

1

There are 1 best solutions below

0
Greg Fenton On

To create a "backup" of a given table, you can simply run a query and then use SAVE RESULTS >> BigQuery Table to save to a new table. Now you have a "backup" and if something goes wrong you can restore from this new table.

I think the following SQL statement might delete "old" rows for you:

DELETE FROM your_dataset.your_table
  WHERE STRUCT(document_id, timestamp) NOT IN (
      SELECT AS STRUCT document_id, MAX(timestamp) AS timestamp
      FROM your_dataset.your_table
      GROUP BY document_id
  )