Identify the data

Data Dictionaries & Patterns ..

Data Identification: Dictionaries and Pattern Analysis

The data identification process efficiently classifies data by leveraging dictionaries and data pattern analysis. This methodology enables the automatic tagging of data based on predefined criteria in dictionary and pattern configuration files.

While Data Catalog comes equipped with a comprehensive set of dictionaries and patterns, it also offers flexibility by allowing users to create custom dictionaries and pattern analysis configurations. This customization ensures that the data identification process can be tailored to meet the specific requirements of any organization.

Data Dictionaries

A data dictionary in a data catalog is a collection of predefined terms and definitions that help classify and tag data within an organization's datasets. It serves as a reference guide for data terms, helping users understand the meaning, usage, and context of each data element.

By leveraging a data dictionary, organizations can ensure consistency, accuracy, and easier data identification and management across different data sets. Custom dictionaries can also be created to meet specific organizational needs.

Let's run through an example: Marital_Status.

Navigate to the 'Management' tile & click on: Dictionaries.
Search for: Marital_Status

When the data is profiled, in our example: 'marital_status' the value is compared, using a rule, against (with a degree of confidence) the predefined dictionary.

Once matched: Tags - PII, Marital Status, Non-Sensitive are then applied.

Click on the > to View Dictionary

Next -> 1.2 Rules

Understanding a Data Dictionary Rule

A data dictionary rule is a set of criteria defined to automate the process of matching and tagging data within a dataset based on specific attributes. Below is a breakdown of the key components of a data dictionary rule, illustrated by the Marital_Status example:

• Type: Identifies the rule as a dictionary type, used for matching data against a list of predefined terms.

• minSamples: Specifies the minimum number of samples (e.g., 200) that must be present for the rule to be considered applicable. This ensures that there is a sufficient data volume for accurate matching.

• confidenceScore: A formula used to calculate how confidently the data matches the dictionary. This score is a combination of:

• similarity to the dictionary values (*0.9),

• metadataScore which might consider other attributes of the data (*0.1).

• condition: Defines the conditions under which tags are applied. In this case:

• The confidenceScore must be greater than or equal to 70%.

• The columnCardinality (the count of unique values in the column) must be 3 or more.

• actions: The actions to take when the conditions are met. For the Marital_Status, the following tags are applied:

• PII (Personally Identifiable Information),

• Sensitive, with a value of "Marital Status".

The rule actively searches column data for terms listed in the Marital_Status dictionary. If a term matches with sufficient confidenceScore and meets the specified conditions, the column is automatically tagged accordingly, making it easier for organizations to identify and manage sensitive data.

Click on the Rules tab.

[
    {
        "__typename": "dictionariesRules",
        "type": "Dictionary",
        "minSamples": 200,
        "confidenceScore": {
            "+": [
                {
                    "*": [
                        {
                            "var": "similarity"
                        },
                        0.9
                    ]
                },
                {
                    "*": [
                        {
                            "var": "metadataScore"
                        },
                        0.1
                    ]
                }
            ]
        },
        "condition": {
            "and": [
                {
                    ">=": [
                        {
                            "var": "confidenceScore"
                        },
                        "0.7"
                    ]
                },
                {
                    ">=": [
                        {
                            "var": "columnCardinality"
                        },
                        "3"
                    ]
                }
            ]
        },
        "actions": [
            {
                "applyTags": [
                    {
                        "k": "PII"
                    },
                    {
                        "k": "Sensitive",
                        "v": "Marital Status",
                        "t": "sdd;"
                    }
                ]
            }
        ]
    }
]

Basically if a 'confidenceScore' (calculated from the 'similarity' to the dictionary values and the 'columnCardinality') is >= to 70% and Cardinality is >=3 then applyTags:

• PII

• Sensitive

• Marital Status