Data Identification

Data Identification

Pentaho Data Catalog uses data identification methods called Dictionaries and Data Patterns to help you identify data.

Data Dictionaries in Data Catalogs

A data dictionary in a data catalog is a specialized collection of predefined terms, values, and definitions used to automatically classify and tag data elements within an organization's datasets. Unlike traditional data dictionaries that simply define terminology, catalog data dictionaries function as intelligent matching tools that scan actual data content to identify and categorize information.

These dictionaries serve as automated reference systems that help data catalogs recognize specific types of data by comparing column values against predefined lists of known terms. For example, a country codes dictionary might contain "US," "CA," "GB" to automatically identify geography-related columns, while a product name dictionary could contain specific product identifiers to classify commercial data.

How Data Dictionaries Enable Data Discovery

Data dictionaries in catalogs primarily support column data matching - the process of automatically identifying what type of information a data column contains based on its actual values rather than just column names or metadata. This is particularly valuable for data elements that can't be identified through pattern matching alone, such as:

  • Country or state codes

  • Product names or SKUs

  • Department codes

  • Custom business terminology

  • Industry-specific classifications

Types of Data Dictionaries

Modern data catalogs typically support two categories of dictionaries:

System-Defined Dictionaries: Built-in collections that come pre-configured with the catalog, containing common data types like ISO country codes, currency symbols, or standard industry classifications.

User-Defined Dictionaries: Custom collections created by organizations to match their specific business context and terminology. These can be created through multiple approaches:

  • Importing structured files (CSV with JSON definitions)

  • Building dictionaries through the user interface

  • Extracting dictionary terms directly from existing profiled data columns

Benefits for Data Management

By implementing data dictionaries, organizations can achieve:

  • Automated Data Classification: Systematic identification of data types without manual tagging

  • Improved Data Discovery: Users can find relevant datasets by searching for business terms rather than technical column names

  • Consistency Across Systems: Standardized identification of similar data elements across different databases and applications

  • Enhanced Data Governance: Better understanding of what data exists and where it's located

This automated approach transforms data catalogs from passive repositories into active discovery tools that understand the semantic meaning of organizational data.

Data Dictionary

  1. Log into Data Catalog:

Username: [email protected]

Password: Welcome123!

View - System Dictionary

Pentaho Data Catalog ships with 95 In-built Dictionaries, pre-configured with common data types like ISO country codes, currency symbols, or standard industry classifications.

We're going to take a look at: Marital_Status

  1. Click: Data Operations > Data Identification Methods.

Data Identification Methods
  1. Click on Dictionaries.

PDC In-built Dictionaries
  1. Click on the Name to sort A - Z.

Sort A - Z
  1. Scroll to Page ⅘

  2. Click on the 3 dots for Marital_Status > View

Dictionaries - View Marital_Status
View Dictionary

When the data is profiled, in our example: 'Marital_Status' the value is compared, using a Rule, against (with a degree of confidence) the predefined dictionary.

Once matched: Tags - PII, Marital Status, Non-Sensitive are then applied.

  1. Click on Rules

It provides insight into logic for the dictionary to apply tags mentioned in the JSON file, such as conditions and confidence scores. Based on these data factors, you can apply dictionaries to datasets.

For example, in the following JSON file for Marital_Status the dictionary rule specifies:

  • type is "Dictionary".

  • confidence score is calculated based on the weighted sum of "similarity=0.9" x "metadataScore=0.1" with conditions set to apply when the confidence score is greater than or equal to 0.7 and the column cardinality is greater than or equal to 3.

  • if these conditions are met, the action is to apply the tags: PII, Sensitive: Marital Status to the dataset.

This demonstrates how the provided logic guides the application of tags to datasets based on specified criteria.

[
    {
        "type": "Dictionary",
        "minSamples": 200,
        "confidenceScore": {
            "+": [
                {
                    "*": [
                        {
                            "var": "similarity"
                        },
                        0.9
                    ]
                },
                {
                    "*": [
                        {
                            "var": "metadataScore"
                        },
                        0.1
                    ]
                }
            ]
        },
        "condition": {
            "and": [
                {
                    ">=": [
                        {
                            "var": "confidenceScore"
                        },
                        "0.7"
                    ]
                },
                {
                    ">=": [
                        {
                            "var": "columnCardinality"
                        },
                        "3"
                    ]
                }
            ]
        },
        "actions": [
            {
                "applyTags": [
                    {
                        "name": "PII"
                    },
                    {
                        "name": "Sensitive",
                        "value": "Marital Status",
                        "t": "sdd;"
                    }
                ]
            }
        ]
    }
]

To illustrate, let's apply a data identification policy to the patients 'table'. This will apply identify and tag various columns in the table, this helps:

  • to quickly locate relevant information, saving time and effort when exploring vast data repositories.

  • link related data across different sections of the taxonomy.

  • derive maximum value from information assets by understanding their context and purpose.

  1. Click the Data Identification tile.

Data Identification
  1. Click 'Select Methods'.

Data Identification - Select methods

Applying Data Dictionary + Pattern Analysis = Policy

  1. Select the following Data Dictionaries & Data Patterns:

Method
Data Dictionaries
Pattern Analysis

USA_SSN

Social Security Number

Country Codes

Country Names

Country Names

Country Names

DoB

Date of Birth

USA States

States in USA

Data Identification - Policy
  1. Click Start.

  2. Track the Job in the Workers.

Workers - Data Identification Job
  1. In Data Canvas, check that the sensitive data in the Synthea -> 'patients' table has now been identified - tagged as PII & Sensitive.

Tags - PII & Sensitive Data

x

x

Last updated

Was this helpful?