Google Cloud Data Loss Prevention Inspection Jobs Sampling

73 Views Asked by At

Since a couple of weeks, it seems the sampling percent of a DLP Inspection Job will fail to retrieve an actual percentage of data to be scanned in BigQuery.

The inspection job had a config with both timespan and percentage sampling set. (see image)

  • Example from a month ago: 143000 rows in BQ > 48000 rows analyzed
  • Example from yesterday: 141000 rows in BQ -> 420 rows analyzed

Is there a logic in the sampling method that I'm missing? If so, why did it change without a code change?

I've looked at the documentation where it states that "you can sample a subset of the total selected rows, corresponding to the percentage of files you specify to include in the scan". Which is frankly unclear since it seems to be talking about files and not rows.

1

There are 1 best solutions below

0
Poala Astrid On

When using Google Cloud Data Loss Prevention (DLP) inspection jobs with sampling, the percentage of files you specify for sampling is meant to apply to the total number of rows scanned. This means that the sampling percentage should determine the fraction of rows from your dataset that will be selected for analysis.

For example, this documentation demonstrates using the Cloud DLP API to scan a 90% subset of a Cloud Storage bucket for person names. The scan starts from a random location in the dataset and only includes text files under 200 bytes.

In the documentation you followed, you started your sampling at random start which could affect on the result of the rows being analyzed. Sampling settings could affect the number of rows selected for analysis.

You can monitor the DLP job over multiple runs to see if the behavior is consistent or if it fluctuates.