Data Identification

Data Dictionaries in Data Catalogs

A data dictionary in a data catalog is a specialized collection of predefined terms, values, and definitions used to automatically classify and tag data elements within an organization's datasets. Unlike traditional data dictionaries that simply define terminology, catalog data dictionaries function as intelligent matching tools that scan actual data content to identify and categorize information.

These dictionaries serve as automated reference systems that help data catalogs recognize specific types of data by comparing column values against predefined lists of known terms. For example, a country codes dictionary might contain "US," "CA," "GB" to automatically identify geography-related columns, while a product name dictionary could contain specific product identifiers to classify commercial data.

How Data Dictionaries Enable Data Discovery

Data dictionaries in catalogs primarily support column data matching - the process of automatically identifying what type of information a data column contains based on its actual values rather than just column names or metadata. This is particularly valuable for data elements that can't be identified through pattern matching alone, such as:

Country or state codes
Product names or SKUs
Department codes
Custom business terminology
Industry-specific classifications

Types of Data Dictionaries

Modern data catalogs typically support two categories of dictionaries:

System-Defined Dictionaries: Built-in collections that come pre-configured with the catalog, containing common data types like ISO country codes, currency symbols, or standard industry classifications.

User-Defined Dictionaries: Custom collections created by organizations to match their specific business context and terminology. These can be created through multiple approaches:

Importing structured files (CSV with JSON definitions)
Building dictionaries through the user interface
Extracting dictionary terms directly from existing profiled data columns

Benefits for Data Management

By implementing data dictionaries, organizations can achieve:

Automated Data Classification: Systematic identification of data types without manual tagging
Improved Data Discovery: Users can find relevant datasets by searching for business terms rather than technical column names
Consistency Across Systems: Standardized identification of similar data elements across different databases and applications
Enhanced Data Governance: Better understanding of what data exists and where it's located

This automated approach transforms data catalogs from passive repositories into active discovery tools that understand the semantic meaning of organizational data.

Log into Data Catalog:

https://pdc.pentaho.labpdc.pentaho.lab

Username: [email protected]

Password: Welcome123!

View - System Dictionary

Pentaho Data Catalog ships with 95 In-built Dictionaries, pre-configured with common data types like ISO country codes, currency symbols, or standard industry classifications.

We're going to take a look at: Marital_Status

Click: Data Operations > Data Identification Methods.

Click on Dictionaries.

Click on the Name to sort A - Z.

Scroll to Page ⅘
Click on the 3 dots for Marital_Status > View

When the data is profiled, in our example: 'Marital_Status' the value is compared, using a Rule, against (with a degree of confidence) the predefined dictionary.

Once matched: Tags - PII, Marital Status, Non-Sensitive are then applied.

Click on Rules

It provides insight into logic for the dictionary to apply tags mentioned in the JSON file, such as conditions and confidence scores. Based on these data factors, you can apply dictionaries to datasets.

For example, in the following JSON file for Marital_Status the dictionary rule specifies:

type is "Dictionary".
confidence score is calculated based on the weighted sum of "similarity=0.9" x "metadataScore=0.1" with conditions set to apply when the confidence score is greater than or equal to 0.7 and the column cardinality is greater than or equal to 3.
if these conditions are met, the action is to apply the tags: PII, Sensitive: Marital Status to the dataset.

This demonstrates how the provided logic guides the application of tags to datasets based on specified criteria.

[
    {
        "type": "Dictionary",
        "minSamples": 200,
        "confidenceScore": {
            "+": [
                {
                    "*": [
                        {
                            "var": "similarity"
                        },
                        0.9
                    ]
                },
                {
                    "*": [
                        {
                            "var": "metadataScore"
                        },
                        0.1
                    ]
                }
            ]
        },
        "condition": {
            "and": [
                {
                    ">=": [
                        {
                            "var": "confidenceScore"
                        },
                        "0.7"
                    ]
                },
                {
                    ">=": [
                        {
                            "var": "columnCardinality"
                        },
                        "3"
                    ]
                }
            ]
        },
        "actions": [
            {
                "applyTags": [
                    {
                        "name": "PII"
                    },
                    {
                        "name": "Sensitive",
                        "value": "Marital Status",
                        "t": "sdd;"
                    }
                ]
            }
        ]
    }
]

Data Patterns

A pattern analysis (or data patterns) document defines the data pattern, regular expression, and column alias(es) and tags that you can use to identify a column of data. You can use data patterns for a variety of purposes, such as regular expression (RegEx) generation, data identification, and data quality checking.

Let's run through an example: USA_GBR_Passport_Numbers.

Navigate to the 'Management' tile & click on: patterns.
InSearch for: USA_GBR_Passport_Numbers.

When the data is profiled, in our example: 'passport' the regexMatch will produce a profilePattern that can be compared (with a degree of confidence) against the predefined profilePattern; once matched Tags - PII, GBR and USA Passport, Sensitive are then applied.

Fundamental to data quality analysis is either the use of a regular expression to check data, or to statistically analyze the data itself to find patterns and outlier patterns (which could indicate bad data).

View the data in the Patients table.

The data identification process generates roughly the top 20 most common patterns which capture the characteristics of the data. You can then use these patterns, along with their statistical frequency and supplementary information, to generate regular expression (RegEx) recommendations for your data. You can tune the RegEx to meet your specific needs, or select valid patterns, so that subsequent data quality checks will identify any data entries that are outside the accepted patterns.

For further information: take a look at Data Patterns in Data Identification Methods.

Next -> 2.2 Rules

Data Patterns in Data Identification

Data patterns play a crucial role in identifying and categorizing data within a data catalog. These patterns are essentially recurring characteristics or behaviors in data sets that can be recognized and used to automate data management.

'Getting Started' -> 'Identify the data' explained how data patterns are used to profile the data

Data Pattern Analysis reduces each data item into a simple pattern essentially using dimensional reduction for each character position in the input text. The result is a string which indicates where alphabetic characters, numeric characters, symbols, and whitespace appear.

KT-1734B generates a data pattern of “AA-nnnnA” to indicate two letters, followed by a dash, followed by 4 digits and another letter.

Case sensitivity could optionally be tracked as well. Also, the set of “significant” symbols might be user-configurable (i.e., “As a data quality engineer, for this column, a dash and an underscore are significant”).

The base process iterates over every character in the data item and performs a simple character-for-character substitution, resulting in a “data pattern” string for the item.

The pattern consists of the following characters:

Character

Description

a

lower case alphabet character

A

upper case alphabetic character

n

digit 0..9

w

whitespace character (space, tab)

s

symbol character (e.g., -/|!£$%^&*()+=[]{}@#~;:,.?¬¥§¢" )

-

Some other character (control, special symbol, etc.)

Others

Any other symbol may be treated as “significant” (such as a dash, underscore, or colon). These are output as-is in the generated data pattern for the entry.

Additional tests could be built into the algorithm to look for certain additional characteristics. For example, date formats can be very tricky. PDC could observe that ‘nn/nn/nnnn’ is a date and could then observe whether it is predominantly ‘mm/dd/yyyy’ or if its ‘dd/mm/yyyy.’

Another enhancement is detecting credit card numbers.