Data Identification
Dictionaries & Data Patterns ..
Data Identification
Pentaho Data Catalog uses data identification methods called Dictionaries and Data Patterns to help you identify data.

Data Dictionaries in Data Catalogs
A data dictionary in a data catalog is a specialized collection of predefined terms, values, and definitions used to automatically classify and tag data elements within an organization's datasets. Unlike traditional data dictionaries that simply define terminology, catalog data dictionaries function as intelligent matching tools that scan actual data content to identify and categorize information.
These dictionaries serve as automated reference systems that help data catalogs recognize specific types of data by comparing column values against predefined lists of known terms. For example, a country codes dictionary might contain "US," "CA," "GB" to automatically identify geography-related columns, while a product name dictionary could contain specific product identifiers to classify commercial data.

Log into Data Catalog:
Username: [email protected]
Password: Welcome123!
Types of Data Dictionaries
Modern data catalogs typically support two categories of dictionaries:
System-Defined Dictionaries: Built-in collections that come pre-configured with the catalog, containing common data types like ISO country codes, currency symbols, or standard industry classifications.
User-Defined Dictionaries: Custom collections created by organizations to match their specific business context and terminology. These can be created through multiple approaches:
Importing structured files (CSV with JSON definitions)
Building dictionaries through the user interface
Extracting dictionary terms directly from existing profiled data columns
Pentaho Data Catalog ships with 95 In-built Dictionaries, pre-configured with common data types: e.g. ISO country codes, currency symbols, or standard industry classifications.
Let's take a look at: Marital_Status
Click: Data Operations > Data Identification Methods.

Click on Dictionaries.

Click on the Name to sort A - Z.

Scroll to Page ⅘
Click on the 3 dots for Marital_Status > View


Click on Rules
[
{
"type": "Dictionary",
"minSamples": 200,
"confidenceScore": {
"+": [
{
"*": [
{
"var": "similarity"
},
0.9
]
},
{
"*": [
{
"var": "metadataScore"
},
0.1
]
}
]
},
"condition": {
"and": [
{
">=": [
{
"var": "confidenceScore"
},
"0.7"
]
},
{
">=": [
{
"var": "columnCardinality"
},
"3"
]
}
]
},
"actions": [
{
"applyTags": [
{
"name": "PII"
},
{
"name": "Sensitive",
"value": "Marital Status",
"t": "sdd;"
}
]
}
]
}
]Click the Data Identification tile.

Click 'Select Methods'.

Select the following Data Dictionaries & Data Patterns:
USA_SSN
Social Security Number
Country Codes
Country Names
Country Names
Country Names
DoB
Date of Birth
USA States
States in USA

Click Start.
Track the Job in the Workers.

In Data Canvas, check that the sensitive data in the Synthea -> 'patients' table has now been identified - tagged as PII & Sensitive.

x
x
Last updated
Was this helpful?










