Stream data into BigQuery data mart with three different dataflow jobs, get correct surrogate key for the fact table from dimensional tables

100 Views Asked by David Radianu At 01 November 2022 at 10:03

I have unbounded data being streamed into BigQuery from three PubSub topics, with three separate Dataflow jobs, into one fact table + 2 dimensional tables. I want to know how to proceed if I want to link the fact rows to the dimensional rows through a surrogate key for keeping track of the Slowly Changing Dimensions(SCD). The data will first get inserted into the dimensional tables in BigQuery (retrieved faster from the PubSub topic). During the insertion of the fact data (after the insertion to the dimensional tables), the dataflow job should somehow efficiently do a lookup in the dimensional tables to retrieve the surrogate key for the specific fact table row, storing the SCD in cache?

I have found these examples on lookup during data streaming jobs: Pattern: Streaming mode large lookup tables Pattern: Streaming mode large lookup tables and Use an appropriate side-input pattern for data enrichment.

In the Google example "Streaming mode large lookup tables" they say that "We recommend that you cache the result of the external lookups to reduce the number of lookups.", how should we cache a BigQuery table that gets updated frequently with another dataflow job?

Original Q&A

Stream data into BigQuery data mart with three different dataflow jobs, get correct surrogate key for the fact table from dimensional tables

There are 0 best solutions below

Related Questions in GOOGLE-BIGQUERY

Related Questions in GOOGLE-CLOUD-DATAFLOW

Related Questions in DATA-WAREHOUSE

Related Questions in STAR-SCHEMA-DATAWAREHOUSE

Trending Questions

Popular # Hahtags

Popular Questions