I have unbounded data being streamed into BigQuery from three PubSub topics, with three separate Dataflow jobs, into one fact table + 2 dimensional tables. I want to know how to proceed if I want to link the fact rows to the dimensional rows through a surrogate key for keeping track of the Slowly Changing Dimensions(SCD). The data will first get inserted into the dimensional tables in BigQuery (retrieved faster from the PubSub topic). During the insertion of the fact data (after the insertion to the dimensional tables), the dataflow job should somehow efficiently do a lookup in the dimensional tables to retrieve the surrogate key for the specific fact table row, storing the SCD in cache?
I have found these examples on lookup during data streaming jobs: Pattern: Streaming mode large lookup tables Pattern: Streaming mode large lookup tables and Use an appropriate side-input pattern for data enrichment.
In the Google example "Streaming mode large lookup tables" they say that "We recommend that you cache the result of the external lookups to reduce the number of lookups.", how should we cache a BigQuery table that gets updated frequently with another dataflow job?