Background:
I'm working on a Google Cloud Platform (GCP) project that involves Datastream for streaming changes from a MYSQL database into a Cloud Storage Bucket. I've also set up an EventArc trigger that fires when an object is created in this bucket (Event type: google.cloud.storage.object.v1.finalized).
The EventArc trigger is responsible for invoking a Google Workflow, which checks for schema changes and then pushes the data into a BigQuery table if no schema changes are detected.
Issue:
The problem I'm encountering is inconsistency in the subject field within the JSON payload that I receive from the EventArc trigger. Sometimes the subject contains the full path to the Cloud Storage object, while other times it only contains the prefix (i.e., the path to the directory containing the object).
Specific Concern:
The core issue here is that I wish to transfer only the specific object that triggered the EventArc to ensure no data is transferred more than once. When I receive just the path (the prefix) instead of the full object name in the subject field, it becomes difficult to isolate the object that caused the trigger. I could loop through all objects in the given path or use a wildcard to transfer all matching objects, but this approach poses a risk of duplicating transfers.
For example, if I get a request at 14:01:02 seconds, and another one comes in at 14:01:50, using a wildcard or loop would not guarantee that I'm isolating and transferring only the newly created object that invoked the EventArc. This leads to potential data duplication and inefficiencies in the workflow.
For example:
Full Path:
"subject":"objects/<storage_bucket>/2023/09/06/12/40/ae10dc08dbed566c3e999005a1b565416b38fe51_mysql-cdc-binlog_1014338102_6_14695541.avro"Prefix Only:
"subject": "objects/<storage_bucket>/2023/09/06/13/21/"
Additional Information:
Here's a full sample request payload:
{
"bucket": "***",
"data": {
"bucket": "***",
"contentType": "application/octet-stream",
"crc32c": "ofTnjQ==",
...
"name": "***",
...
"size": "23301",
"storageClass": "STANDARD",
...
},
"datacontenttype": "application/json",
"id": "8811078306811423",
"source": "//storage.googleapis.com/projects/_/buckets/***",
"specversion": "1.0",
"subject": "objects/***/2023/09/06/12/40/***",
"time": "2023-09/06T12:43:07.949071Z",
"type": "google.cloud.storage.object.v1.finalized"
}
Additional Note:
I've checked the Cloud Storage bucket, and the number of objects in a given path doesn't seem to influence whether the subject field contains a specific object or just a directory.
Questions:
What could be causing this inconsistency in the
subjectfield?Is there a way to ensure that the
subjectalways contains either the full path or just the prefix?Are there any additional configurations that I might be missing to achieve this?
The way Google Cloud Storage generates these events can account for the discrepancy you've noticed in the topic field of the
JSONpayload from theEventarc trigger. The way Cloud Storage manages object creation events accounts for the behaviour you're describing.When the topic field of an event has the whole route, it signifies that the event was started by the creation of a particular object in the bucket. However, if you just see the prefix in the topic field of an event, it means that several objects were likely created in that directory (prefix) at about the same time. For efficiency, Google Cloud Storage will combine these events into a single event.
This behavior is intended for high-throughput use scenarios when numerous objects are created in the same directory quickly succession in order to optimise event delivery and decrease the number of individual events generated.
To answer your questions:
subjectfield?The event handling behaviour of Google Cloud Storage is what causes the discrepancy. You receive a prefix rather than a list of individual object paths when several objects are produced in the same directory quickly. This is done by Cloud Storage to maximize performance.
subjectalways contains either the full path or just the prefix?Unfortunately, there isn't a built-in configuration option to let Cloud Storage broadcast separate events for each object or just the prefix all the time. The quantity of objects that are created within a directory quickly determines the behaviour.
This behaviour cannot be directly configured to regulate it. You can take the following actions to address this circumstance in your workflow effectively:
Be sure to adequately handle both cases (full path and prefix) in your workflow. To prevent repetitions, you must list the objects in the prefix directory and process each one individually.
Implement a de-duplication mechanism in your workflow. You can keep track of processed objects by storing their unique identifiers (ex: object names or metadata) and check against this list to prevent duplicates.
To reduce the possibility of many items being produced in the same directory quickly, think about changing your bucket structure or naming rules. This might lessen the likelihood of prefix-only events.