# Data Identification Methods

{% hint style="info" %}

A Pentaho Data Catalog Data Identification policy is a combination of Data Dictionaries + Patterns.
{% endhint %}

<figure><img src="/files/XvKUoZscabCMGiu9wKLT" alt=""><figcaption><p>Management -  Data Identification Methods</p></figcaption></figure>

#### Accessing Your Catalog

To access your catalog, please follow these steps:

1. Open Google Chrome web browser. and click on the bookmark, or

   Navigate to: [**https://pdc.pentaho.example/**](https://pdc.pentaho.example/)
2. Enter the following email and password, then click Sign In.

<table data-header-hidden><thead><tr><th width="156"></th><th></th></tr></thead><tbody><tr><td>Username</td><td>data_steward@hv.com</td></tr><tr><td>Password</td><td>Welcome123!</td></tr></tbody></table>

{% hint style="warning" %}

#### **Security Advisory: Handling Login Credentials**

For enhanced security, it is strongly recommended that users avoid saving their login details directly in web browsers. Browsers may inadvertently autofill these credentials in unrelated fields, posing a security risk.

**Best Practice**

• **Disable Autofill:** To mitigate potential risks, users should disable the autofill functionality for login credentials in their browser settings. This preventive measure ensures that sensitive information is not unintentionally exposed or misused.
{% endhint %}

3. From the Business Rules card click Add New and select: Add Business Rule.

{% tabs %}
{% tab title="1. Dictionaries" %}
{% hint style="info" %}
Data dictionaries contain technical information about data assets, such as data sources, fields and data types. They are typically used by technical audiences such as data engineers and data analysts to understand the data. Data catalogs contain much broader and deeper data intelligence than data dictionaries do.
{% endhint %}

<figure><img src="/files/Vkg8s9yh3Fb5KWbk2m5y" alt=""><figcaption><p>Data Dictionary v Data Catalog</p></figcaption></figure>

{% tabs %}
{% tab title="First Tab" %}
x
{% endtab %}

{% tab title="Second Tab" %}
x
{% endtab %}
{% endtabs %}
{% endtab %}

{% tab title="2. Data Patterns" %}
{% hint style="info" %}

#### Data Patterns in Data Identification

Data patterns play a crucial role in identifying and categorizing data within a data catalog. These patterns are essentially recurring characteristics or behaviors in data sets that can be recognized and used to automate data management.&#x20;

'Getting Started' -> 'Identify the data' explained how data patterns are used to profile the data
{% endhint %}

<div align="left"><figure><img src="/files/wxr6Sw0KyqXwTkrZoDsT" alt=""><figcaption><p>If our data comes from a certain probability distribution, we can reduce its size by estimating the parameters of this distribution.</p></figcaption></figure></div>

{% tabs %}
{% tab title="2.1 Data Patterns" %}
{% hint style="info" %}
Data Pattern Analysis reduces each data item into a simple pattern essentially using dimensional reduction for each character position in the input text. The result is a string which indicates where alphabetic characters, numeric characters, symbols, and whitespace appear.
{% endhint %}

{% hint style="info" %}
**KT-1734B** generates a data pattern of “AA-nnnnA” to indicate two letters, followed by a dash, followed by 4 digits and another letter.&#x20;

Case sensitivity could optionally be tracked as well. Also, the set of “significant” symbols might be user-configurable (i.e., “As a data quality engineer, for this column, a dash and an underscore are significant”).

The base process iterates over every character in the data item and performs a simple character-for-character substitution, resulting in a “data pattern” string for the item.&#x20;
{% endhint %}

The pattern consists of the following characters:

<table><thead><tr><th width="192">Character</th><th>Description</th></tr></thead><tbody><tr><td>a</td><td>lower case alphabet character</td></tr><tr><td>A</td><td> upper case alphabetic character</td></tr><tr><td>n</td><td>digit 0..9</td></tr><tr><td>w</td><td>whitespace character (space, tab)</td></tr><tr><td>s</td><td>symbol character (e.g., -/|!£$%^&#x26;*()+=[]{}@#~;:,.?¬¥§¢" )</td></tr><tr><td>-</td><td>Some other character (control, special symbol, etc.)</td></tr><tr><td>Others</td><td>Any other symbol may be treated as “significant” (such as a dash, underscore, or colon). These are output as-is in the generated data pattern for the entry. </td></tr></tbody></table>

{% hint style="warning" %}
Additional tests could be built into the algorithm to look for certain additional characteristics. For example, date formats can be very tricky. PDC could observe that ‘nn/nn/nnnn’ is a date and could then observe whether it is predominantly ‘mm/dd/yyyy’ or if its ‘dd/mm/yyyy.’&#x20;

Another enhancement is detecting credit card numbers.&#x20;
{% endhint %}
{% endtab %}

{% tab title="2.2 Pattern Character Position " %}
{% hint style="info" %}
Lets look at an  example:&#x20;

Part numbers often begin with two or three designated letters. This observation helps in defining a more precise RegEx rule based on observed patterns.&#x20;

Additionally, tracking the "largest" and "smallest" values for each character position in these patterns reveals the degree of variability per position. Each time a pattern recurs, a counter tallies its occurrence; upon identifying a new pattern, the system stores the analyzed data as a distinct “sample” for that pattern.
{% endhint %}

The first step is to generate a substitution string (for purpose of the example, not all possible characters are shown):

abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789/|"!£$%^&\*()+=\[]{}@#\~;:,.?¬¥§¢

aaaaaaaaaaaaaaaaaaaaaaaaaawAAAAAAAAAAAAAAAAAAAAAAAAAAnnnnnnnnnn/sssssssssssssssssssssssssssss

The top row is the character lookup row; and the bottom row is the substitution to be made for each character position.&#x20;

For example, “KT127-3” would generate a simple pattern “AAnnn-n”. Additionally, the largest and smallest character seen for each character position is also tracked.

&#x20;Consider a set of tracking numbers and the associated pattern for each:&#x20;

<table><thead><tr><th width="187">Code</th><th width="214">Pattern</th></tr></thead><tbody><tr><td>KT17341</td><td><mark style="color:blue;">AAnnnnn</mark></td></tr><tr><td>KL91632</td><td><mark style="color:blue;">AAnnnnn</mark></td></tr><tr><td>KW81234</td><td><mark style="color:blue;">AAnnnnn</mark></td></tr><tr><td>KW91020</td><td><mark style="color:blue;">AAnnnnn</mark></td></tr><tr><td>KA002021</td><td><mark style="color:red;">AAnnnnnn</mark></td></tr></tbody></table>

{% hint style="info" %}
Additionally, we capture the largest and smallest character seen in each character position. This allows us to potentially determine if there are fixed characters in the pattern, and to generate stricter RegEx recommendations.&#x20;
{% endhint %}

* <mark style="color:blue;">AAnnnnn   – Occurs 4 times</mark>
* KA11220     – Lowest character seen in each position
* KW97644  – Highest character seen in each position
* <mark style="color:red;">AAnnnnnn – Occurs 1 time</mark>
* KA002021 – Lowest character seen in each position
* KA002021 – Highest character seen in each position

The top \~20 data patterns will be captured and stored for subsequent consumption by data quality related and other processes as needed.&#x20;
{% endtab %}
{% endtabs %}
{% endtab %}
{% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.pentaho.com/pentaho-data-catalog-en/snippet/management/data-identification-methods.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
