How to create a custom masking rule to de-identify or obscure data with DLP and BigQuery?

337 Views Asked by At

As per (Link), it's possible to mask sensitive data by partially o fully replacing characteres with a symbol (De-identifying sensitive data) using the DLP API in GCP. I didn't find any glue to customize the transformation rule in the request, for example, Let's say we need to transform the 16-digit account number, where once the value has been detected, the first 6 digits "and" the last 4 digits will be left intact while the rest of the digits will be replaced by "*" (123456******3456), and any such combination, however, the configuration seems to only allow the transformation of the first "or" last digits of the field.

{
  "deidentifyConfig": {
    "recordTransformations": {
      "fieldTransformations": [
        {
          "fields": [
            {
              "name": "NUMBER_ACCOUNT"
            }
          ],
          "primitiveTransformation": {
            "characterMaskConfig": {
              "maskingCharacter": "#",
              "numberToMask": -6
            }
          }
        }
      ]
    }
  }

Result of the code above:

"stringValue": "#########123456"

The tag numberToMask allow to set the number of characters to mask, and, in combination with reverseOrder we can obscure just first o last digits, but, what about both?

is it possible to use REGEX or tranformation rule to create a custom deidentifyConfig or what should be the approach to inspect (detect) a specifict sensitive data and apply any custom masking rule using DLP?

For example, how to get this masked values:

12345678****3456
12345678******56

Note. Dynamic Data Masking in BigQuery is not an option here, since in there does't exist a way to create a custom masking rule yet

3

There are 3 best solutions below

1
Jordanna Chord On

That ability is not currently supported, but I'll record it as a feature request for the team.

1
Mike DaCosta On

One workaround is to define a custom infoType with a regex that matches your account number, and provide a matching group, like this:

  "inspectConfig": {
    "customInfoTypes": [
      {
        "infoType": {
          "name": "NUMBER_ACCOUNT_TYPE"
        },
        "likelihood": "LIKELY",
        "regex": {
          "pattern": "\\d{8}(\\d{4})\\d{4}",
          "groupIndexes": [
            1
          ]
        }
      }
    ]
  },

Then use infoTypeTransformations to mask your custom infoType finding:

  "deidentifyConfig": {
    "recordTransformations": {
      "fieldTransformations": [
        {
          "fields": [
            {
              "name": "NUMBER_ACCOUNT"
            }
          ],
          "infoTypeTransformations": {
            "transformations": [
              {
                "infoTypes": [
                  {
                    "name": "NUMBER_ACCOUNT_TYPE"
                  }
                ],
                "primitiveTransformation": {
                  "characterMaskConfig": {
                    "maskingCharacter": "#"
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }

Example request using the REST API: https://cloud.google.com/dlp/docs/reference/rest/v2/projects.locations.content/deidentify?apix=true&apix_params=%7B%22parent%22%3A%22projects%2Fproject-id%2Flocations%2Fglobal%22%2C%22resource%22%3A%7B%22item%22%3A%7B%22table%22%3A%7B%22headers%22%3A%5B%7B%22name%22%3A%22NUMBER_ACCOUNT%22%7D%2C%7B%22name%22%3A%22NUMBER_OTHER%22%7D%5D%2C%22rows%22%3A%5B%7B%22values%22%3A%5B%7B%22stringValue%22%3A%221234567890123456%22%7D%2C%7B%22stringValue%22%3A%221234567890123456%22%7D%5D%7D%5D%7D%7D%2C%22inspectConfig%22%3A%7B%22customInfoTypes%22%3A%5B%7B%22infoType%22%3A%7B%22name%22%3A%22NUMBER_ACCOUNT_TYPE%22%7D%2C%22likelihood%22%3A%22LIKELY%22%2C%22regex%22%3A%7B%22pattern%22%3A%22%5C%5Cd%7B8%7D(%5C%5Cd%7B4%7D)%5C%5Cd%7B4%7D%22%2C%22groupIndexes%22%3A%5B1%5D%7D%7D%5D%7D%2C%22deidentifyConfig%22%3A%7B%22recordTransformations%22%3A%7B%22fieldTransformations%22%3A%5B%7B%22fields%22%3A%5B%7B%22name%22%3A%22NUMBER_ACCOUNT%22%7D%5D%2C%22infoTypeTransformations%22%3A%7B%22transformations%22%3A%5B%7B%22infoTypes%22%3A%5B%7B%22name%22%3A%22NUMBER_ACCOUNT_TYPE%22%7D%5D%2C%22primitiveTransformation%22%3A%7B%22characterMaskConfig%22%3A%7B%22maskingCharacter%22%3A%22%23%22%7D%7D%7D%5D%7D%7D%5D%7D%7D%7D%7D

0
Pratibha Chowdary On

I see your use case can be met with following steps. Hope it helps.

We can create UDFs for use with custom masking routines based on this documentation. https://cloud.google.com/bigquery/docs/user-defined-functions#custom-mask

For the described use case above, we can do the following:

Step 1- Create the following UDF -

CREATE OR REPLACE FUNCTION custom_mask(NUMBER_ACCOUNT STRING) RETURNS STRING
OPTIONS (data_governance_type="DATA_MASKING") AS (
SAFE.REGEXP_REPLACE(
    NUMBER_ACCOUNT,
    r'^(.{6}).*(.{4})$',
    r'\1' || REPEAT('*', LENGTH(NUMBER_ACCOUNT) - 8) || r'\2'
  ));

Step 2 - Create a policy tag and attach this custom masking rule to the policy tag.

Step 3 - Tag the required column with this policy tag to dynamically mask it.