# Data Identification {% hint style="info" %} #### Data Identification Pentaho Data Catalog uses data identification methods called Dictionaries and Data Patterns to help you identify data. {% endhint %}

{% tabs %} {% tab title="1. Data Dictionary" %} {% hint style="info" %} ### Data Dictionaries in Data Catalogs A **data dictionary** in a data catalog is a specialized collection of predefined terms, values, and definitions used to automatically classify and tag data elements within an organization's datasets. Unlike traditional data dictionaries that simply define terminology, catalog data dictionaries function as intelligent matching tools that scan actual data content to identify and categorize information. These dictionaries serve as automated reference systems that help data catalogs recognize specific types of data by comparing column values against predefined lists of known terms. For example, a country codes dictionary might contain "US," "CA," "GB" to automatically identify geography-related columns, while a product name dictionary could contain specific product identifiers to classify commercial data. {% endhint %}

{% hint style="info" %} **How Data Dictionaries Enable Data Discovery** Data dictionaries in catalogs primarily support **column data matching** - the process of automatically identifying what type of information a data column contains based on its actual values rather than just column names or metadata. This is particularly valuable for data elements that can't be identified through pattern matching alone, such as: * Country or state codes * Product names or SKUs * Department codes * Custom business terminology * Industry-specific classifications {% endhint %} {% hint style="info" %} **Benefits for Data Management** By implementing data dictionaries, organizations can achieve: * **Automated Data Classification**: Systematic identification of data types without manual tagging * **Improved Data Discovery**: Users can find relevant datasets by searching for business terms rather than technical column names * **Consistency Across Systems**: Standardized identification of similar data elements across different databases and applications * **Enhanced Data Governance**: Better understanding of what data exists and where it's located This automated approach transforms data catalogs from passive repositories into active discovery tools that understand the semantic meaning of organizational data. {% endhint %} *** 1. Log into Data Catalog: {% embed url="" %} Username: Password: Welcome123! {% tabs %} {% tab title="1. View - System Dictionary" %} {% hint style="info" %} #### **Types of Data Dictionaries** Modern data catalogs typically support two categories of dictionaries: **System-Defined Dictionaries**: Built-in collections that come pre-configured with the catalog, containing common data types like ISO country codes, currency symbols, or standard industry classifications. **User-Defined Dictionaries**: Custom collections created by organizations to match their specific business context and terminology. These can be created through multiple approaches: * Importing structured files (CSV with JSON definitions) * Building dictionaries through the user interface * Extracting dictionary terms directly from existing profiled data columns Pentaho Data Catalog ships with 95 In-built Dictionaries, pre-configured with common data types: e.g. ISO country codes, currency symbols, or standard industry classifications. Let's take a look at: **Marital\_Status** {% endhint %} 1. Click: Data Operations > Data Identification Methods.

3. Click on Dictionaries.

4. Click on the Name to sort A - Z.

5. Scroll to Page ⅘ 6. Click on the 3 dots for Marital\_Status > View

{% hint style="info" %} When the data is profiled, in our example: 'Marital\_Status' the value is compared, using a Rule, against (with a degree of confidence) the predefined dictionary. Once matched: Tags - PII, Marital Status, Non-Sensitive are then applied. {% endhint %} 7. Click on Rules {% hint style="info" %} It provides insight into logic for the dictionary to apply tags mentioned in the JSON file, such as conditions and confidence scores. Based on these data factors, you can apply dictionaries to datasets. For example, in the following JSON file for Marital\_Status the dictionary rule specifies: * type is "Dictionary". * confidence score is calculated based on the weighted sum of "similarity=0.9" x "metadataScore=0.1" with conditions set to apply when the confidence score is greater than or equal to 0.7 and the column cardinality is greater than or equal to 3. * if these conditions are met, the action is to apply the tags: PII, Sensitive: Marital Status to the dataset. This demonstrates how the provided logic guides the application of tags to datasets based on specified criteria. {% endhint %} ```json [ { "type": "Dictionary", "minSamples": 200, "confidenceScore": { "+": [ { "*": [ { "var": "similarity" }, 0.9 ] }, { "*": [ { "var": "metadataScore" }, 0.1 ] } ] }, "condition": { "and": [ { ">=": [ { "var": "confidenceScore" }, "0.7" ] }, { ">=": [ { "var": "columnCardinality" }, "3" ] } ] }, "actions": [ { "applyTags": [ { "name": "PII" }, { "name": "Sensitive", "value": "Marital Status", "t": "sdd;" } ] } ] } ] ``` {% endtab %} {% tab title="2. Edit - System Dictionary" %} {% hint style="info" %} #### Edit -System Dictionary Back to: Marital\_Status {% endhint %} 1. Follow steps 1-6 outlined in : View - System Dictionary 2. Select: Edit

{% hint style="info" %} #### Dictionary UI For the time being a quick overview of the Fields and Description as we'll be covering the UI in more detail when creating our own User-defined Dictionaries: [Personal Data Identifier](/pentaho-data-catalog-en/data-catalog/data-identification/personal-data-identifier.md) {% endhint %}

Field	Description
Category	Select a category or type a category name and click Add New to add it as a new category. You can also remove an existing category.
Apply Values
Upload Dictionary	Click Upload Dictionary to upload a one-column CSV file containing dictionary file definitions and enter a number 0.00 to 1.0 for the confidence score.
Select Column	Click Select Column to set a profiled column of data to use for the dictionary. Click Add Column to browse the navigation tree for a column to add for dictionary values. Select the column you want to use. Note: The column you select must already be profiled. Click Update. Enter a number 0.00 to 1.0 for the confidence score.
Column Name Regex (1)
Regex	Add a regex as a metadata hint for the column name. Enter a number 0.00 to 1.0 for the confidence score.
Condition
AND / OR	In the Condition pane, you can click the Delete icon and remove the existing condition or click Add Condition and select an Attribute, Operator, and Value to set an additional condition. Select either AND or OR to apply multiple conditions to evaluate and match the data.
Actions
Assign Tags	Click Assign Tags and enter a tag to assign to the data.
Assign Table Tags	Click Assign Table Tags and enter a table tag to assign to the data.
Assign Business Term	Click Assign Business Term to select a business term to assign to the data. Browse the navigation tree for one or more business terms and select the associated checkboxes. A number on the Add button shows the number of terms you have selected. Click Add.

{% endtab %} {% endtabs %} {% endtab %} {% tab title="2. Data Patterns" %} {% hint style="info" %} #### Data Patterns A pattern analysis (or data patterns) defines the data pattern, regular expression, column alias(es) and tags that you can use to identify a column of data. You can use data patterns for a variety of purposes, such as regular expression (RegEx) generation, data identification, and data quality checking. {% endhint %}

{% tabs %} {% tab title="1. How does it work?" %} {% hint style="info" %} #### Data Patterns in Data Identification Data patterns play a crucial role in identifying and categorizing data within a data catalog. These patterns are essentially recurring characteristics or behaviors in data sets that can be recognized and used to automate data management. 'Getting Started' -> 'Identify the data' explained how data patterns are used to profile the data {% endhint %}

If our data comes from a certain probability distribution, we can reduce its size by estimating the parameters of this distribution.

{% tabs %} {% tab title="1. Pattern Analysis" %} {% hint style="info" %} #### Pattern Analysis Data Pattern Analysis reduces each data item into a simple pattern essentially using dimensional reduction for each character position in the input text. The result is a string which indicates where alphabetic characters, numeric characters, symbols, and whitespace appear. {% endhint %}

Column	Part Code
Widget Parts #	KT1734-B
	KT189A-D
	KT2231-C

{% hint style="info" %} **Part Code: KT-1734B** generates a data pattern of “AAnnnn-A” to indicate two letters, followed by a dash, followed by 4 digits and another letter. Case sensitivity could optionally be tracked as well. Also, the set of “significant” symbols might be user-configurable (i.e., “As a data quality engineer, for this column, a dash and an underscore are significant”). The base process iterates over every character in the data item and performs a simple character-for-character substitution, resulting in a “data pattern” string for the item. {% endhint %} The pattern consists of the following characters:

Character	Description
a	lower case alphabet character
A	upper case alphabetic character
n	digit 0 -> 9
w	whitespace character (space, tab)
s	symbol character (e.g., -/\|!£$%^&*()+=[]{}@#~;:,.?¬¥§¢" )
-	Some other character (control, special symbol, etc.)
Others	Any other symbol may be treated as “significant” (such as a dash, underscore, or colon). These are output as-is in the generated data pattern for the entry.

{% hint style="warning" %} Additional tests could be built into the algorithm to look for certain additional characteristics. For example, date formats can be very tricky. PDC could observe that ‘nn/nn/nnnn’ is a date and could then observe whether it is predominantly ‘mm/dd/yyyy’ or if its ‘dd/mm/yyyy.’ Another enhancement is detecting credit card numbers. {% endhint %} {% endtab %} {% tab title="2. Pattern Character Position " %} {% hint style="info" %} #### Pattern Character Poistion Lets dive a bit deeper into: Part Codes Part numbers often begin with two or three designated letters. This observation helps in defining a more precise RegEx rule based on observed patterns. Additionally, tracking the "largest" and "smallest" values for each character position in these patterns reveals the degree of variability per position. Each time a pattern recurs, a counter tallies its occurrence; upon identifying a new pattern, the system stores the analyzed data as a distinct “sample” for that pattern. {% endhint %} The first step is to generate a substitution string (for purpose of the example, not all possible characters are shown): {% hint style="info" %} **abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789/|"!£$%^&\*()+=\[]{}@#\~;:,.?¬¥§¢** **aaaaaaaaaaaaaaaaaaaaaaaaaawAAAAAAAAAAAAAAAAAAAAAAAAAAnnnnnnnnnn/sssssssssssssssssssssssssssss** {% endhint %} The top row is the character lookup row; and the bottom row is the substitution to be made for each character position. For example, “KT127-3” would generate a simple pattern “AAnnn-n”. Additionally, the largest and smallest character seen for each character position is also tracked. Consider a set of Part Codes and the associated pattern for each:

Part Code	Pattern
KT17341	AAnnnnn
KL91632	AAnnnnn
KW81234	AAnnnnn
KW91020	AAnnnnn
KA002021	AAnnnnnn

{% hint style="info" %} Additionally, we capture the largest and smallest character seen in each character position. This allows us to potentially determine if there are fixed characters in the pattern, and to generate stricter RegEx recommendations. {% endhint %} * AAnnnnn – Occurs 4 times * KA11220 – Lowest character seen in each position * KW97644 – Highest character seen in each position * AAnnnnnn – Occurs 1 time * KA002021 – Lowest character seen in each position * KA002021 – Highest character seen in each position {% hint style="info" %} The top \~20 data patterns will be captured and stored for subsequent consumption by data quality related and other processes as needed {% endhint %} {% endtab %} {% endtabs %} {% endtab %} {% tab title="2. View Data Pattern" %} Let's run through an example: USA\_GBR\_Passport\_Numbers. 1. Navigate to the 'Management' tile & click on: patterns. 2. InSearch for: USA\_GBR\_Passport\_Numbers.

When the data is profiled, in our example: 'passport' the regexMatch will produce a profilePattern that can be compared (with a degree of confidence) against the predefined profilePattern; once matched Tags - PII, GBR and USA Passport, Sensitive are then applied.

{% hint style="info" %} Fundamental to data quality analysis is either the use of a regular expression to check data, or to statistically analyze the data itself to find patterns and outlier patterns (which could indicate bad data). {% endhint %} 2. View the data in the Patients table.

{% hint style="info" %} The data identification process generates roughly the top 20 most common patterns which capture the characteristics of the data. You can then use these patterns, along with their statistical frequency and supplementary information, to generate regular expression (RegEx) recommendations for your data. You can tune the RegEx to meet your specific needs, or select valid patterns, so that subsequent data quality checks will identify any data entries that are outside the accepted patterns. {% endhint %}

For further information: take a look at [**Data Patterns**](/pentaho-data-catalog-en/snippet/management/data-identification-methods.md#data-patterns) in Data Identification Methods. 3. Next -> [**2.2 Rules**](#id-2.2-rules) {% endtab %} {% endtabs %} {% endtab %} {% tab title="3. Policies" %} {% hint style="info" %} #### **Policies** The dictionaries and data patterns are together referred to as data identification policies. There are many policies included with Data Catalog, covering categories from a wide range of business sectors, such as Finance, Education, Aviation, Law Enforcement, PCI-DSS and Data Privacy. After running data identification, you can use the Galaxy View feature to visualize the data tagging, identify the data flow, locate your data, and view the sensitivity and security. {% endhint %} 1. Select the following 'Method Names' to define your policy:

{% hint style="info" %} The search bar at the top of the pop-up window can be used to search policies, this feature becomes more helpful as the list of policies grows. {% endhint %} 2. Once the Data Identification step has completed, check the results in the Data Canvas.

{% hint style="info" %} Note the relevant 'patients' fields have now been tagged. {% endhint %} {% endtab %} {% endtabs %} {% hint style="info" %} To illustrate, let's apply a data identification policy to the patients 'table'. This will apply identify and tag various columns in the table, this helps: * to quickly locate relevant information, saving time and effort when exploring vast data repositories. * link related data across different sections of the taxonomy. * derive maximum value from information assets by understanding their context and purpose. {% endhint %} 1. Click the Data Identification tile.

2. Click 'Select Methods'.

{% hint style="info" %} Applying Data Dictionary + Pattern Analysis = Policy {% endhint %} 3. Select the following Data Dictionaries & Data Patterns:

Method	Data Dictionaries	Pattern Analysis
USA_SSN		Social Security Number
Country Codes	Country Names
Country Names	Country Names
DoB		Date of Birth
USA States	States in USA

4. Click Start. 5. Track the Job in the Workers.

6. In Data Canvas, check that the sensitive data in the Synthea -> 'patients' table has now been identified - tagged as PII & Sensitive.

x x --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://academy.pentaho.com/pentaho-data-catalog-en/data-catalog/data-identification.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.