We are in the process of developing a data lake to consolidate and manage data across various departments using a single GCP project.
Our objective is to ingest and process JSON data through Pub/Sub and DataFlow, subsequently storing it in specifically designed tables for each data within a single BigQuery dataset. For instance, our configuration is like:
project-id.common_dataset.table_for_datatype1
project-id.common_dataset.table_for_datatype2
We're interested in exploring methods to allocate the costs incurred by Pub/Sub, DataFlow, and BigQuery usage to individual departments.
Should the JSON data ingested by Pub/Sub include department-specific labels (for instance, {"department": "marketing",...}), is there good way to estimate the costs incurred for processing and storing each department's data through Pub/Sub, DataFlow, and BigQuery?
example
Messages received by Pub/Sub should include the department, as follows.
{"department":"marketing", "log_type":"type1", "value":1}
{"department":"product1", "log_type":"type1", "value":1}
{"department":"marketing", "log_type":"type2", "amount":10, "date":"2024-03-06"}
{"department":"product1", "log_type":"type2", "amount":20, "date":"2024-03-05"}
{"department":"product2", "log_type":"type2", "amount":20, "date":"2024-03-04"}
For log_type=type1, the bigquery table is like project-id.common_dataset.type1
| department | value |
|---|---|
| marketing | 1 |
| product1 | 1 |
For log_type=type2, project-id.common_dataset.type2
| department | amount | date |
|---|---|---|
| marketing | 10 | 2024-03-06 |
| product1 | 20 | 2024-03-05 |
| product2 | 20 | 2024-03-04 |
What we ultimately want is this kind of estimated monthly cost for Pub/Sub, DataFlow, and BigQuery.
| department | Pub/Sub | DataFlow | BigQuery |
|---|---|---|---|
| marketing | 100 USD | 100 USD | 100 USD |
| product1 | 1000 USD | 1000 USD | 1000 USD |
| product2 | 500 USD | 500 USD | 500 USD |