Personal Data Identifier
Personal Data Identifier Dictionary (GDPR Compliance)
This comprehensive workshop demonstrates how to create data dictionaries that not only improve data discovery but also ensure regulatory compliance and support robust data governance frameworks.
Using Microsoft's AdventureWorks2022 database as our foundation, participants will learn to build data dictionaries that meet the stringent requirements of GDPR, SOX, and other regulatory frameworks while establishing best practices for ongoing metadata management.
By the end of this workshop, you will be able to:
Create comprehensive data dictionaries that support regulatory compliance requirements
Understand the relationship between data dictionaries, data lineage, and metadata management
Implement data classification schemes that meet industry-specific regulatory standards
Establish processes for dictionary maintenance that ensure ongoing compliance
Integrate dictionaries with broader data governance and security frameworks
Design audit trails and documentation that satisfy regulatory requirements
Why does this matter?
Why Regulatory Compliance Matters for Data Dictionaries
Organizations must comply with data privacy regulations that cover 75% of the world's population by the end of 2024, making data classification through dictionaries a critical compliance tool. Data dictionaries serve as the foundation for:
Data Classification: Visual labeling, metadata application, and automated data discovery to meet compliance requirements
Regulatory Reporting: Providing audit trails for data protection impact assessments (DPIAs)
Risk Management: Guarding against accidental data loss and enabling early detection of risky user behavior
Business Purpose: GDPR requires explicit identification and special handling of personal data. This dictionary automatically classifies columns containing personal identifiers to ensure proper data handling and support data subject rights.
x
x
x
x
Schema Analysis
The first tasks in our journey was to conduct a comprehensive Database Inventory for compliance.
Data governance requires a complete inventory of all data assets to ensure comprehensive coverage. This inventory forms the foundation for regulatory compliance by documenting all data sources within the organization.
We're going to
In DBeaver run the following script:
-- Comprehensive schema analysis for compliance documentation
SELECT
s.name AS schema_name,
COUNT(t.name) AS table_count,
STRING_AGG(t.name, ', ') AS table_list
FROM sys.schemas s
LEFT JOIN sys.tables t ON s.schema_id = t.schema_id
WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA')
GROUP BY s.name
ORDER BY table_count DESC;
x
Expected Results: You'll see AdventureWorks2022 contains 5 main schemas:
Person (12 tables): Personal identifiers, addresses, contact information
Sales (24 tables): Transaction data, customer information, financial records
Production (27 tables): Product information, manufacturing data
HumanResources (6 tables): Employee data, payroll information
Purchasing (8 tables): Vendor information, procurement records
x
x
Data classification is fundamental to compliance with GDPR, HIPAA, and other regulations that require special handling of sensitive data categories. This automated classification helps identify data elements that require enhanced protection and specific dictionary treatment.
Compliance Impact: This analysis identifies approximately 45+ potentially sensitive columns across the database, including:
HIGH Sensitivity: Email addresses, phone numbers, credit card information
MEDIUM Sensitivity: Names, addresses, demographic data
LOW Sensitivity: Product codes, business identifiers
x
x
x
x
Data lineage is crucial for compliance auditing and impact analysis, helping organizations understand how changes to data affect downstream processes. This relationship analysis forms the foundation for creating dictionaries that understand data dependencies.
In DBeaver
-- Analyze relationships and dependencies for lineage mapping
WITH TableRelationships AS (
SELECT
OBJECT_SCHEMA_NAME(fk.parent_object_id) AS child_schema,
OBJECT_NAME(fk.parent_object_id) AS child_table,
COL_NAME(fkc.parent_object_id, fkc.parent_column_id) AS child_column,
OBJECT_SCHEMA_NAME(fk.referenced_object_id) AS parent_schema,
OBJECT_NAME(fk.referenced_object_id) AS parent_table,
COL_NAME(fkc.referenced_object_id, fkc.referenced_column_id) AS parent_column,
fk.name AS constraint_name
FROM sys.foreign_keys fk
JOIN sys.foreign_key_columns fkc ON fk.object_id = fkc.constraint_object_id
)
SELECT
parent_schema + '.' + parent_table AS source_table,
COUNT(DISTINCT child_schema + '.' + child_table) AS dependent_tables,
STRING_AGG(child_schema + '.' + child_table, '; ') AS dependencies
FROM TableRelationships
GROUP BY parent_schema, parent_table
HAVING COUNT(DISTINCT child_schema + '.' + child_table) > 2
ORDER BY dependent_tables DESC;
x
Expected Results: Key tables with high dependencies include:
Person.BusinessEntity
(15+ dependent tables)Person.Person
(12+ dependent tables)Production.Product
(8+ dependent tables)
These high-dependency tables require special attention in dictionary design as changes affect multiple business processes.
x
x
x
x
x
x
GDPR requires organizations to identify all personal data and understand its usage context for data protection impact assessments. The frequency analysis helps prioritize dictionary terms based on actual business usage.
In DBeaver run the following script:
-- Extract personal identifiers with business context and usage frequency
WITH PersonalDataElements AS (
SELECT DISTINCT
p.Title,
p.FirstName,
p.MiddleName,
p.LastName,
p.Suffix,
pa.AddressLine1,
pa.AddressLine2,
pa.City,
sp.StateProvinceCode,
sp.Name as StateProvinceName,
cr.CountryRegionCode,
cr.Name as CountryRegionName,
ea.EmailAddress,
pp.PhoneNumber,
pnt.Name as PhoneNumberType
FROM Person.Person p
LEFT JOIN Person.BusinessEntityAddress bea ON p.BusinessEntityID = bea.BusinessEntityID
LEFT JOIN Person.Address pa ON bea.AddressID = pa.AddressID
LEFT JOIN Person.StateProvince sp ON pa.StateProvinceID = sp.StateProvinceID
LEFT JOIN Person.CountryRegion cr ON sp.CountryRegionCode = cr.CountryRegionCode
LEFT JOIN Person.EmailAddress ea ON p.BusinessEntityID = ea.BusinessEntityID
LEFT JOIN Person.PersonPhone pp ON p.BusinessEntityID = pp.BusinessEntityID
LEFT JOIN Person.PhoneNumberType pnt ON pp.PhoneNumberTypeID = pnt.PhoneNumberTypeID
WHERE p.PersonType = 'IN' -- Individual customers only for GDPR scope
),
PersonalDataUsage AS (
-- Analyze how personal data is used across business processes
SELECT
'Email Usage' AS usage_context,
COUNT(DISTINCT ea.EmailAddress) AS unique_values,
COUNT(*) AS total_records,
'Customer communication, account verification' AS business_purpose
FROM Person.EmailAddress ea
UNION ALL
SELECT
'Phone Usage' AS usage_context,
COUNT(DISTINCT pp.PhoneNumber) AS unique_values,
COUNT(*) AS total_records,
'Customer support, delivery coordination' AS business_purpose
FROM Person.PersonPhone pp
UNION ALL
SELECT
'Address Usage' AS usage_context,
COUNT(DISTINCT CONCAT(pa.AddressLine1, pa.City)) AS unique_values,
COUNT(*) AS total_records,
'Shipping, billing, location services' AS business_purpose
FROM Person.Address pa
)
SELECT * FROM PersonalDataUsage;
-- Generate personal titles dictionary
SELECT DISTINCT
Title AS term,
'personal_title' AS category,
'GDPR_PII' AS compliance_tag,
'Personal title or honorific' AS description,
COUNT(*) OVER (PARTITION BY Title) AS frequency
FROM Person.Person
WHERE Title IS NOT NULL
ORDER BY frequency DESC;
x
x
x
x
x
x
Why this format: Each column serves a specific compliance purpose:
compliance_classification: Supports automated data handling rules
gdpr_article: Links to specific GDPR requirements for audit trails
data_subject_rights: Defines which rights apply to each data type
retention_period: Supports automated data lifecycle management
business_context: Required for data protection impact assessments (DPIAs)
x
term,category,compliance_classification,gdpr_article,data_subject_rights,retention_period,description,business_context
Mr.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Personal title or honorific,Customer identification and communication
Ms.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Personal title or honorific,Customer identification and communication
Mrs.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Personal title or honorific,Customer identification and communication
Dr.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Professional title,Customer identification and communication
Sr.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Senior designation title,Customer identification and communication
Sra.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Spanish female title,Customer identification and communication
email_address,contact_info,PII_HIGH,Art_4_1,access_rectify_delete_portability,7_years,Electronic contact information,Customer communication and account management
phone_number,contact_info,PII_MEDIUM,Art_4_1,access_rectify_delete,7_years,Telephone contact information,Customer support and delivery coordination
home_address,location_data,PII_HIGH,Art_4_1,access_rectify_delete,7_years,Residential address information,Shipping and billing services
work_address,location_data,PII_MEDIUM,Art_4_1,access_rectify_delete,7_years,Business address information,B2B communication and delivery
first_name,personal_name,PII_MEDIUM,Art_4_1,access_rectify_delete,indefinite,Given name of individual,Customer identification and personalization
last_name,personal_name,PII_MEDIUM,Art_4_1,access_rectify_delete,indefinite,Family name of individual,Customer identification and legal documentation
middle_name,personal_name,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Middle name or initial,Complete customer identification
suffix,personal_name,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Name suffix (Jr. Sr. III),Customer identification and family relationships
birth_date,sensitive_personal,PII_HIGH,Art_9,explicit_consent_required,special_category,Date of birth,Age verification and personalized services
x
x
x
GDPR compliance requires detailed documentation of data processing activities, legal bases, and data subject rights. This JSON structure provides all necessary metadata for automated compliance monitoring and reporting.
x
{
"name": "Personal Data Identifiers - GDPR Compliant",
"description": "Dictionary for identifying personal data elements subject to GDPR protection and data subject rights",
"category": "Data Privacy & Protection",
"version": "1.0",
"lastUpdated": "2025-08-06",
"regulatoryFramework": {
"primary": "GDPR",
"articles": ["Art_4_1", "Art_9", "Art_17", "Art_20"],
"applicability": "EU residents and EEA",
"dataSubjectRights": ["access", "rectification", "erasure", "portability", "restriction"]
},
"caseSensitive": false,
"matchType": "exact",
"automatedActions": {
"highSensitivity": "encrypt_and_log",
"mediumSensitivity": "log_access",
"specialCategory": "explicit_consent_required"
},
"auditRequirements": {
"logAccess": true,
"logModification": true,
"retentionTracking": true,
"consentTracking": true
},
"columns": [
{
"name": "term",
"description": "Personal data identifier or value",
"required": true
},
{
"name": "category",
"description": "Type of personal data (personal_title, contact_info, location_data, personal_name, sensitive_personal)",
"required": true
},
{
"name": "compliance_classification",
"description": "GDPR sensitivity level (PII_LOW, PII_MEDIUM, PII_HIGH)",
"required": true
},
{
"name": "gdpr_article",
"description": "Relevant GDPR article reference",
"required": true
},
{
"name": "data_subject_rights",
"description": "Applicable data subject rights for this data type",
"required": true
},
{
"name": "retention_period",
"description": "Legal retention requirement",
"required": true
},
{
"name": "description",
"description": "Business definition of the data element",
"required": true
},
{
"name": "business_context",
"description": "Business purpose and usage context for DPIA requirements",
"required": true
}
],
"complianceValidation": {
"requiredFields": ["term", "category", "compliance_classification"],
"sensitivityLevels": ["PII_LOW", "PII_MEDIUM", "PII_HIGH"],
"retentionValidation": true
}
}
x
x
x
x
x
x
x
x
This method uses Pentaho Data Catalog's import functionality to upload ZIP files containing CSV and JSON dictionary definitions.
Pentaho Data Catalog requires dictionaries to be imported as ZIP files containing both CSV data and JSON configuration to minimize the risk of errors due to manual input.
Create the following Dictionary folders:
cd
x
Navigate to Dictionaries:
In the left navigation menu, click Data Operations
On the Data Identification Methods card, click Dictionaries
Import Process:
Click Import button on the Dictionaries page
Upload each ZIP file containing the dictionary CSV and JSON
Click Continue to start the upload process
Monitor progress using View Workers to track the import
Validation:
Verify each dictionary appears in the Dictionaries list
Check the Rules tab to review the JSON configuration logic
Confirm categories and tags are properly assigned
x
Personal Data Identifiers Dictionary
This method demonstrates creating dictionaries entirely through the Pentaho Data Catalog interface by uploading CSV files and configuring JSON definitions through the UI.
The name and category are mandatory fields that help organize dictionaries for data governance. Categories assist in data discovery but are not related to Business Glossary categories
Navigate to: Data Operations > Dictionaries
Click: Add Dictionary
Configure Basic Information:
Name:
Personal Data Identifiers - GDPR Compliant
Description
Dictionary for identifying personal data elements subject to GDPR protection
Category
Select Sensitive
or create new category GDPR_Compliance
Dictionary Status
Ensure it's enabled (toggle switch)
Select Upload Dictionary method
Upload CSV File: Upload
personal_data_identifiers.csv
term
Mr.
Ms.
Mrs.
Dr.
Sr.
Sra.
first_name
last_name
middle_name
email_address
phone_number
birth_date
home_address
work_address
Set Confidence Score:
0.7
(70% confidence for similarity matching)
The confidence score represents how certain the system should be when matching data. For personal data, 0.7 provides a balance between catching sensitive data and avoiding false positives.

Set the following Column Name Regex - Metadata Hints.
Overall Column Name Confidence Score
0.4
[Ff]irst[Nn]ame|[Ff]Name|given_name
0.8
[Ll]ast[Nn]ame|[Ss]urname|family_name
0.8
[Ee]mail|[Ee]mail[Aa]ddress|contact_email
0.9
[Pp]hone|[Pp]hone[Nn]umber|[Tt]elephone
0.8
Regex patterns serve as metadata hints that help identify columns by name patterns. The confidence score formula is:
Confidence score = (Similarity × weightage) + (Metadata hint score × weightage).

Step 4: Configure Conditions Create conditions that determine when to apply the dictionary:
Primary Condition:
Attribute: Confidence Score
Operator: Greater than or equal to
Value:
0.6
Secondary Condition (OR):
Attribute: Metadata Score
Operator: Greater than or equal to
Value:
0.7
Additional Condition (AND):
Attribute: Column Cardinality
Operator: Greater than
Value:
5
(ensures column has enough data variety)
Why these conditions: This creates a flexible matching system where data can be classified either by content similarity or column name patterns, ensuring comprehensive coverage.
Step 5: Configure Actions Define what happens when conditions are met:
Add Action - Assign Tags:
Tag 1:
PII
Tag 2:
GDPR_Personal_Data
Tag 3:
Sensitive
Add Action - Assign Table Tags:
Table Tag:
Contains_Personal_Data
Add Action - Assign Business Terms:
Browse and select relevant business terms from your glossary
Examples:
Personal Identifier
,Customer Data
,Privacy Protected
Step 6: Create Dictionary Click Create Dictionary to finalize the setup.
x
x
x
x
Last updated
Was this helpful?