Data Identification Methods
Data Dictionaries & Patterns ..

Accessing Your Catalog
To access your catalog, please follow these steps:
Open Google Chrome web browser. and click on the bookmark, or
Navigate to: https://pdc.pentaho.example/
Enter the following email and password, then click Sign In.
Username
Password
Welcome123!
Security Advisory: Handling Login Credentials
For enhanced security, it is strongly recommended that users avoid saving their login details directly in web browsers. Browsers may inadvertently autofill these credentials in unrelated fields, posing a security risk.
Best Practice
• Disable Autofill: To mitigate potential risks, users should disable the autofill functionality for login credentials in their browser settings. This preventive measure ensures that sensitive information is not unintentionally exposed or misused.
From the Business Rules card click Add New and select: Add Business Rule.

x
x
Data Patterns in Data Identification
Data patterns play a crucial role in identifying and categorizing data within a data catalog. These patterns are essentially recurring characteristics or behaviors in data sets that can be recognized and used to automate data management.
'Getting Started' -> 'Identify the data' explained how data patterns are used to profile the data

The pattern consists of the following characters:
a
lower case alphabet character
A
upper case alphabetic character
n
digit 0..9
w
whitespace character (space, tab)
s
symbol character (e.g., -/|!£$%^&*()+=[]{}@#~;:,.?¬¥§¢" )
-
Some other character (control, special symbol, etc.)
Others
Any other symbol may be treated as “significant” (such as a dash, underscore, or colon). These are output as-is in the generated data pattern for the entry.
Additional tests could be built into the algorithm to look for certain additional characteristics. For example, date formats can be very tricky. PDC could observe that ‘nn/nn/nnnn’ is a date and could then observe whether it is predominantly ‘mm/dd/yyyy’ or if its ‘dd/mm/yyyy.’
Another enhancement is detecting credit card numbers.
The first step is to generate a substitution string (for purpose of the example, not all possible characters are shown):
abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789/|"!£$%^&*()+=[]{}@#~;:,.?¬¥§¢
aaaaaaaaaaaaaaaaaaaaaaaaaawAAAAAAAAAAAAAAAAAAAAAAAAAAnnnnnnnnnn/sssssssssssssssssssssssssssss
The top row is the character lookup row; and the bottom row is the substitution to be made for each character position.
For example, “KT127-3” would generate a simple pattern “AAnnn-n”. Additionally, the largest and smallest character seen for each character position is also tracked.
Consider a set of tracking numbers and the associated pattern for each:
KT17341
AAnnnnn
KL91632
AAnnnnn
KW81234
AAnnnnn
KW91020
AAnnnnn
KA002021
AAnnnnnn
AAnnnnn – Occurs 4 times
KA11220 – Lowest character seen in each position
KW97644 – Highest character seen in each position
AAnnnnnn – Occurs 1 time
KA002021 – Lowest character seen in each position
KA002021 – Highest character seen in each position
The top ~20 data patterns will be captured and stored for subsequent consumption by data quality related and other processes as needed.
Was this helpful?
