Personal Data Identifier
GDPR Compliance ..
Personal Data Identifier Dictionary (GDPR Compliance)
This comprehensive workshop demonstrates how to create data dictionaries that not only improve data discovery but also ensure regulatory compliance and support robust data governance frameworks.
Using Microsoft's AdventureWorks2022 database as our foundation, participants will learn to build data dictionaries that meet the stringent requirements of GDPR, SOX, and other regulatory frameworks while establishing best practices for ongoing metadata management.
By the end of this workshop, you will be able to:
Create comprehensive data dictionaries that support regulatory compliance requirements
Understand the relationship between data dictionaries, data lineage, and metadata management
Implement data classification schemes that meet industry-specific regulatory standards
Establish processes for dictionary maintenance that ensure ongoing compliance
Integrate dictionaries with broader data governance and security frameworks
Design audit trails and documentation that satisfy regulatory requirements
Why does this matter?
Why Regulatory Compliance Matters for Data Dictionaries
Organizations must comply with data privacy regulations that cover 75% of the world's population by the end of 2024, making data classification through dictionaries a critical compliance tool. Data dictionaries serve as the foundation for:
Data Classification: Visual labeling, metadata application, and automated data discovery to meet compliance requirements
Regulatory Reporting: Providing audit trails for data protection impact assessments (DPIAs)
Risk Management: Guarding against accidental data loss and enabling early detection of risky user behavior
Business Purpose: GDPR requires explicit identification and special handling of personal data. This dictionary automatically classifies columns containing personal identifiers to ensure proper data handling and support data subject rights.
The process follows a structured, four-phase approach that progressively builds capabilities from foundational pattern-based detection through to enterprise-wide privacy operations integration.
Phase 1 establishes the critical foundation by implementing dictionary-based identification using metadata hints, pattern matching, and automated tagging within structured databases. This initial phase enables organizations to quickly identify and classify the majority of obvious PII while building the governance framework, processes, and documentation required for regulatory compliance. Upon completion of Phase 1, organizations should plan for
Phase 2 (content inspection and sampling),
Phase 3 (dataflow mapping and lineage tracking), and
Phase 4 (full integration with privacy operations), which progressively enhance detection accuracy, expand coverage to unstructured data, and integrate PII management into broader enterprise systems.
Phase 1: Pattern-Based Discovery (This Workshop)
Structured database PII identification
Implements dictionary-based identification using column name patterns, regex matching, and metadata hints. Focuses on structured data within relational databases. Establishes foundational governance processes, documentation standards, and automated tagging mechanisms.
• Data dictionaries for all PII categories • Automated tagging rules • GDPR compliance documentation • Initial data inventory • Governance procedures
Phase 2: Content Inspection (Future)
Actual data value analysis
Expands beyond metadata to inspect actual data values using sampling, statistical analysis, and machine learning. Detects PII in free-text fields, comments, and unstructured content. Implements Named Entity Recognition (NER) for contextual PII identification.
• Content inspection rules • ML-based classifiers • False positive reduction • Unstructured data coverage • Enhanced accuracy metrics
Phase 3: Data Flow Mapping (Future)
End-to-end lineage tracking
Maps PII movement through ETL pipelines, APIs, reports, exports, and data integrations. Identifies downstream systems receiving PII. Tracks data transformations, aggregations, and derivations to understand complete data lifecycle.
• Complete data lineage maps • API/interface PII exposure analysis • ETL pipeline documentation • Report/dashboard PII tracking • Cross-system impact analysis
Phase 4: Privacy Operations Integration (Future)
Enterprise-wide privacy ecosystem
Integrates PII identification with data masking, access controls, consent management, and data subject request fulfillment. Implements automated breach notification scope assessment, retention policy enforcement, and continuous compliance monitoring.
• Integrated masking policies • Automated DSR fulfillment • Consent tracking integration • Breach assessment automation • Real-time compliance dashboards
GDPR - PII Dictionary
This workshop comprehensively covers Phase 1, providing all necessary resources, templates, procedures, and documentation to successfully implement pattern-based PII discovery.
x
Database Inventory
Objective: Create a complete inventory of all database schemas and tables to ensure comprehensive coverage.
Why This Matters: GDPR compliance requires documenting all data sources within the organization. This inventory forms the foundation for regulatory compliance.
Schema Analysis
The first tasks in our journey was to conduct a comprehensive Database Inventory for compliance.
Data governance requires a complete inventory of all data assets to ensure comprehensive coverage. This inventory forms the foundation for regulatory compliance by documenting all data sources within the organization.
For Phase 1 we're going to take the traditional route and run some SQL scripts - later we'#ll take a look at a 'Project' that leverages ML.
Why This Matters: GDPR compliance requires documenting all data sources within the organization. This enhanced inventory not only catalogs schemas but automatically identifies PII exposure, calculates risk scores, and prioritizes compliance efforts based on actual data sensitivity.

Person
13
~1,755,295
~460
8
13
212.86
CRITICAL
Personal Data
HumanResources
6
~8,179
~6.5
1
1
52.5
CRITICAL
Employee Data
Sales
19
~2,658,063
~287
4
2
36.5
HIGH
Transaction Data
Purchasing
5
~155,373
~16
1
0
20.41
HIGH
Operational Data
Production
25
~3,417,306
~251
1
1
10.06
MEDIUM
Operational Data
dbo
3
~38,308
~51
0
0
0.00
LOW
System Data
Understanding the Output
high_risk_pii_columns: Count of columns containing emails, phones, SSN, passwords, credit cards
medium_risk_pii_columns: Count of columns with names, addresses, birth dates, postal codes
low_risk_pii_columns: Count of columns with titles, suffixes, gender, marital status
pii_risk_score: Weighted calculation (High×10 + Medium×5 + Low×2) ÷ Total Columns × 100
compliance_priority: Automatic prioritization using the following thresholds:
CRITICAL: 3+ high-risk columns OR risk score ≥ 40
HIGH: Any high-risk columns OR risk score ≥ 15
MEDIUM: Any medium-risk columns OR risk score ≥ 5
LOW: Only low-risk columns OR risk score < 5
data_category: Automatic classification based on schema naming patterns
Key Insights
Person schema is CRITICAL priority with a PII risk score of 212.86 - contains 8 high-risk and 13 medium-risk PII columns spanning 1.7M+ records. This should be your primary focus for dictionary creation.
HumanResources schema also requires immediate CRITICAL attention with a risk score of 52.5 due to employee data sensitivity.
Sales and Purchasing schemas are HIGH priority containing customer contact information and vendor data.
Production schema has MEDIUM priority with minimal PII exposure (1 high, 1 medium risk column).
dbo schema shows LOW priority with no PII columns detected - likely contains system/configuration data.
Deliverable:
Export this result set to Excel for reference throughout the project
Use the
compliance_prioritycolumn to sequence your dictionary creation workDocument the
pii_risk_scorebaseline for quarterly tracking of new PII exposure
In DBeaver run the following script:
Column Analysis
Data classification is fundamental to compliance with GDPR, HIPAA, and other regulations that require special handling of sensitive data categories. This automated classification helps identify data elements that require enhanced protection and specific dictionary treatment.
Compliance Impact: This analysis identifies approximately 45+ potentially sensitive columns across the database, including:
HIGH Sensitivity: Email addresses, phone numbers, credit card information
MEDIUM Sensitivity: Names, addresses, demographic data
LOW Sensitivity: Product codes, business identifiers
Why This Matters: Data classification is fundamental to GDPR, HIPAA, and other regulations requiring special handling of sensitive data categories. This automated approach ensures comprehensive coverage and reduces manual review time.

Expected Results
Approximately 45+ sensitive columns identified across AdventureWorks2022 with automated risk assessment:
Person
PII_HIGH
8
EmailAddress, BirthDate, AddressLine1
150-250
URGENT/HIGH
Person
PII_MEDIUM
15
FirstName, LastName, PhoneNumber, City
80-150
HIGH/MEDIUM
Person
PII_LOW
3
Title, Suffix
30-50
LOW
HumanResources
PII_MEDIUM
5
JobTitle, HireDate
40-80
MEDIUM
Sales
PII_MEDIUM
8
CustomerID (string), SalesPersonID
60-120
MEDIUM
Sales
PII_HIGH
2
CreditCardApprovalCode
100-150
HIGH
Understanding the Output
schema_name, table_name, column_name: Identifies the exact location of sensitive data
data_type: SQL data type (helps validate PII detection accuracy)
max_length: Maximum field length (context for string data)
is_nullable: Whether column allows NULL values (data quality indicator)
sensitivity_level: Risk classification
PII_CRITICAL: Passwords, SSN, credit cards, medical records - requires immediate protection
PII_HIGH: Email, birth dates, full addresses, biometric data - direct identifiers
PII_MEDIUM: Names, phone numbers, city, salary - indirect identifiers when combined
PII_LOW: Titles, country, language - minimal risk, still personal data
data_category: Functional grouping for dictionary organization
contact_info- Email, phone, faxpersonal_name- First, last, middle nameslocation_data- Addresses, GPS coordinatessensitive_personal- Health, biometric, ethnicity (GDPR Article 9)financial_data- Credit cards, salary, tax informationauthentication- Passwords, login credentialsgovernment_id- SSN, passport, driver's licensepersonal_title- Mr., Dr., Jr.identifier- Employee ID, Customer IDdemographic- Gender, marital status, language
gdpr_article: Maps to GDPR compliance requirements
Art_9_Special_Categories- Requires explicit consent and heightened protectionArt_32_Security- Mandates technical security measuresArt_4_1_Personal_Data- General personal data definition
recommended_retention: Suggested data lifecycle
Active_Account_Only- Delete immediately when account closes7_years- Standard business/financial records retention10_years- Medical/health records retentionIndefinite- Core identity data with legitimate ongoing need5_years- Default retention for other personal data
table_row_count: Number of records in the table (impact assessment)
risk_impact_score: Calculated metric combining sensitivity × data volume
Formula: Sensitivity weight (10-100) × LOG10(row_count + 1)
Higher scores indicate both sensitive data AND high volume
Used to prioritize remediation efforts
dictionary_priority: Action urgency for dictionary creation
IMMEDIATE: PII_CRITICAL with >10K rows - start today
URGENT: PII_CRITICAL or PII_HIGH with >50K rows - start this week
HIGH: PII_HIGH or PII_MEDIUM with >100K rows - start this month
MEDIUM: PII_MEDIUM standard volume - quarterly cycle
LOW: PII_LOW - routine monitoring
Key Insights
Person.EmailAddress emerges as URGENT priority with PII_HIGH classification and ~19,000 records, yielding a risk impact score of approximately 206. This should be the first dictionary created.
Name columns (FirstName, LastName, MiddleName) across Person schema total 15+ columns classified as PII_MEDIUM with similar high volume, making them HIGH priority for consolidated dictionary treatment.
HumanResources schema contains employee salary and hire date information classified as PII_MEDIUM, requiring separate dictionary handling due to different legal retention requirements (7 years vs indefinite).
No PII_CRITICAL columns detected in AdventureWorks2022, which is expected for a sample database. Production databases typically contain passwords (hashed), credit card numbers, or SSN requiring immediate protection.
Contact information dominates with 23+ columns across email, phone, and address categories - these can be grouped into 3 related dictionaries: Contact Info, Location Data, and Personal Names.
Sales schema shows CustomerID as string-based (varchar), flagged as PII_MEDIUM since string IDs could be email addresses or usernames. Numeric customer IDs would not be flagged.
Title column only flagged in Person/HumanResources schemas due to context-aware detection - the same column name in Production schema (product titles) correctly excluded.
Quality Validation Checklist:
This detailed analysis provides the foundation for creating targeted, effective data dictionaries in Stage 2.
Deliverable
Export Results to CSV:
Right-click query results → Export → CSV format
Save as
sensitive_columns_detail_[DATE].csvThis becomes your master PII inventory document
Create Prioritized Action Plan:
Document False Positives/Negatives:
Review results for incorrectly classified columns
Document true PII columns missed by pattern matching
Note any context-specific exceptions
Feed findings back into pattern refinement
Create Data Category Summary:
Prepare for Stage 1.3 (Data Lineage Analysis):
Use
table_namecolumn to identify high-impact tablesTables with 5+ PII columns require detailed lineage mapping
Focus on Person.Person, Person.Address, Person.EmailAddress
Generate Executive Summary:
In DBeaver run the following script:
Data lineage is crucial for compliance auditing and impact analysis, helping organizations understand how changes to data affect downstream processes. This relationship analysis forms the foundation for creating dictionaries that understand data dependencies.
In DBeaver
x
Expected Results: Key tables with high dependencies include:
Person.BusinessEntity(15+ dependent tables)Person.Person(12+ dependent tables)Production.Product(8+ dependent tables)
These high-dependency tables require special attention in dictionary design as changes affect multiple business processes.
x
x
x
x
GDPR requires organizations to identify all personal data and understand its usage context for data protection impact assessments. The frequency analysis helps prioritize dictionary terms based on actual business usage.
In DBeaver run the following script:
x
x
x
x
x
x
Why this format: Each column serves a specific compliance purpose:
compliance_classification: Supports automated data handling rules
gdpr_article: Links to specific GDPR requirements for audit trails
data_subject_rights: Defines which rights apply to each data type
retention_period: Supports automated data lifecycle management
business_context: Required for data protection impact assessments (DPIAs)
x
x
x
x
GDPR compliance requires detailed documentation of data processing activities, legal bases, and data subject rights. This JSON structure provides all necessary metadata for automated compliance monitoring and reporting.
x
x
x
x
x
x
x
x
x
Personal Data Identifiers Dictionary
This method demonstrates creating dictionaries entirely through the Pentaho Data Catalog interface by uploading CSV files and configuring JSON definitions through the UI.
The name and category are mandatory fields that help organize dictionaries for data governance. Categories assist in data discovery but are not related to Business Glossary categories
Navigate to: Data Operations > Dictionaries
Click: Add Dictionary
Configure Basic Information:
Name:
Personal Data Identifiers - GDPR Compliant
Description
Dictionary for identifying personal data elements subject to GDPR protection
Category
Select Sensitive or create new category GDPR_Compliance
Dictionary Status
Ensure it's enabled (toggle switch)
Select Upload Dictionary method
Upload CSV File: Upload
personal_data_identifiers.csv
Set Confidence Score:
0.7(70% confidence for similarity matching)
The confidence score represents how certain the system should be when matching data. For personal data, 0.7 provides a balance between catching sensitive data and avoiding false positives.

Set the following Column Name Regex - Metadata Hints.
Overall Column Name Confidence Score
0.4
[Ff]irst[Nn]ame|[Ff]Name|given_name
0.8
[Ll]ast[Nn]ame|[Ss]urname|family_name
0.8
[Ee]mail|[Ee]mail[Aa]ddress|contact_email
0.9
[Pp]hone|[Pp]hone[Nn]umber|[Tt]elephone
0.8
Regex patterns serve as metadata hints that help identify columns by name patterns. The confidence score formula is:
Confidence score = (Similarity × weightage) + (Metadata hint score × weightage).

Step 4: Configure Conditions Create conditions that determine when to apply the dictionary:
Primary Condition:
Attribute: Confidence Score
Operator: Greater than or equal to
Value:
0.6
Secondary Condition (OR):
Attribute: Metadata Score
Operator: Greater than or equal to
Value:
0.7
Additional Condition (AND):
Attribute: Column Cardinality
Operator: Greater than
Value:
5(ensures column has enough data variety)
Why these conditions: This creates a flexible matching system where data can be classified either by content similarity or column name patterns, ensuring comprehensive coverage.
Step 5: Configure Actions Define what happens when conditions are met:
Add Action - Assign Tags:
Tag 1:
PIITag 2:
GDPR_Personal_DataTag 3:
Sensitive
Add Action - Assign Table Tags:
Table Tag:
Contains_Personal_Data
Add Action - Assign Business Terms:
Browse and select relevant business terms from your glossary
Examples:
Personal Identifier,Customer Data,Privacy Protected
Step 6: Create Dictionary Click Create Dictionary to finalize the setup.
x
x
This method uses Pentaho Data Catalog's import functionality to upload ZIP files containing CSV and JSON dictionary definitions.
Pentaho Data Catalog requires dictionaries to be imported as ZIP files containing both CSV data and JSON configuration to minimize the risk of errors due to manual input.
Create the following Dictionary folders:
x
Navigate to Dictionaries:
In the left navigation menu, click Data Operations
On the Data Identification Methods card, click Dictionaries
Import Process:
Click Import button on the Dictionaries page
Upload each ZIP file containing the dictionary CSV and JSON
Click Continue to start the upload process
Monitor progress using View Workers to track the import
Validation:
Verify each dictionary appears in the Dictionaries list
Check the Rules tab to review the JSON configuration logic
Confirm categories and tags are properly assigned
x
Last updated
Was this helpful?
