Personal Data Identifier

This comprehensive workshop demonstrates how to create data dictionaries that not only improve data discovery but also ensure regulatory compliance and support robust data governance frameworks.

Using Microsoft's AdventureWorks2022 database as our foundation, participants will learn to build data dictionaries that meet the stringent requirements of GDPR, SOX, and other regulatory frameworks while establishing best practices for ongoing metadata management.

By the end of this workshop, you will be able to:

Create comprehensive data dictionaries that support regulatory compliance requirements
Understand the relationship between data dictionaries, data lineage, and metadata management
Implement data classification schemes that meet industry-specific regulatory standards
Establish processes for dictionary maintenance that ensure ongoing compliance
Integrate dictionaries with broader data governance and security frameworks
Design audit trails and documentation that satisfy regulatory requirements

Why does this matter?

Why Regulatory Compliance Matters for Data Dictionaries

Organizations must comply with data privacy regulations that cover 75% of the world's population by the end of 2024, making data classification through dictionaries a critical compliance tool. Data dictionaries serve as the foundation for:

Data Classification: Visual labeling, metadata application, and automated data discovery to meet compliance requirements
Regulatory Reporting: Providing audit trails for data protection impact assessments (DPIAs)
Risk Management: Guarding against accidental data loss and enabling early detection of risky user behavior

Business Purpose: GDPR requires explicit identification and special handling of personal data. This dictionary automatically classifies columns containing personal identifiers to ensure proper data handling and support data subject rights.

Schema Analysis

The first tasks in our journey was to conduct a comprehensive Database Inventory for compliance.

Data governance requires a complete inventory of all data assets to ensure comprehensive coverage. This inventory forms the foundation for regulatory compliance by documenting all data sources within the organization.

We're going to

In DBeaver run the following script:

-- Comprehensive schema analysis for compliance documentation
SELECT 
    s.name AS schema_name,
    COUNT(t.name) AS table_count,
    STRING_AGG(t.name, ', ') AS table_list
FROM sys.schemas s
LEFT JOIN sys.tables t ON s.schema_id = t.schema_id
WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA')
GROUP BY s.name
ORDER BY table_count DESC;

Expected Results: You'll see AdventureWorks2022 contains 5 main schemas:

Person (12 tables): Personal identifiers, addresses, contact information
Sales (24 tables): Transaction data, customer information, financial records
Production (27 tables): Product information, manufacturing data
HumanResources (6 tables): Employee data, payroll information
Purchasing (8 tables): Vendor information, procurement records

Data lineage is crucial for compliance auditing and impact analysis, helping organizations understand how changes to data affect downstream processes. This relationship analysis forms the foundation for creating dictionaries that understand data dependencies.

In DBeaver

-- Analyze relationships and dependencies for lineage mapping
WITH TableRelationships AS (
    SELECT 
        OBJECT_SCHEMA_NAME(fk.parent_object_id) AS child_schema,
        OBJECT_NAME(fk.parent_object_id) AS child_table,
        COL_NAME(fkc.parent_object_id, fkc.parent_column_id) AS child_column,
        OBJECT_SCHEMA_NAME(fk.referenced_object_id) AS parent_schema,
        OBJECT_NAME(fk.referenced_object_id) AS parent_table,
        COL_NAME(fkc.referenced_object_id, fkc.referenced_column_id) AS parent_column,
        fk.name AS constraint_name
    FROM sys.foreign_keys fk
    JOIN sys.foreign_key_columns fkc ON fk.object_id = fkc.constraint_object_id
)
SELECT 
    parent_schema + '.' + parent_table AS source_table,
    COUNT(DISTINCT child_schema + '.' + child_table) AS dependent_tables,
    STRING_AGG(child_schema + '.' + child_table, '; ') AS dependencies
FROM TableRelationships
GROUP BY parent_schema, parent_table
HAVING COUNT(DISTINCT child_schema + '.' + child_table) > 2
ORDER BY dependent_tables DESC;

Expected Results: Key tables with high dependencies include:

Person.BusinessEntity (15+ dependent tables)
Person.Person (12+ dependent tables)
Production.Product (8+ dependent tables)

These high-dependency tables require special attention in dictionary design as changes affect multiple business processes.

GDPR requires organizations to identify all personal data and understand its usage context for data protection impact assessments. The frequency analysis helps prioritize dictionary terms based on actual business usage.

In DBeaver run the following script:

-- Extract personal identifiers with business context and usage frequency
WITH PersonalDataElements AS (
    SELECT DISTINCT 
        p.Title,
        p.FirstName,
        p.MiddleName,
        p.LastName,
        p.Suffix,
        pa.AddressLine1,
        pa.AddressLine2,
        pa.City,
        sp.StateProvinceCode,
        sp.Name as StateProvinceName,
        cr.CountryRegionCode,
        cr.Name as CountryRegionName,
        ea.EmailAddress,
        pp.PhoneNumber,
        pnt.Name as PhoneNumberType
    FROM Person.Person p
    LEFT JOIN Person.BusinessEntityAddress bea ON p.BusinessEntityID = bea.BusinessEntityID
    LEFT JOIN Person.Address pa ON bea.AddressID = pa.AddressID
    LEFT JOIN Person.StateProvince sp ON pa.StateProvinceID = sp.StateProvinceID
    LEFT JOIN Person.CountryRegion cr ON sp.CountryRegionCode = cr.CountryRegionCode
    LEFT JOIN Person.EmailAddress ea ON p.BusinessEntityID = ea.BusinessEntityID
    LEFT JOIN Person.PersonPhone pp ON p.BusinessEntityID = pp.BusinessEntityID
    LEFT JOIN Person.PhoneNumberType pnt ON pp.PhoneNumberTypeID = pnt.PhoneNumberTypeID
    WHERE p.PersonType = 'IN' -- Individual customers only for GDPR scope
),
PersonalDataUsage AS (
    -- Analyze how personal data is used across business processes
    SELECT 
        'Email Usage' AS usage_context,
        COUNT(DISTINCT ea.EmailAddress) AS unique_values,
        COUNT(*) AS total_records,
        'Customer communication, account verification' AS business_purpose
    FROM Person.EmailAddress ea
    UNION ALL
    SELECT 
        'Phone Usage' AS usage_context,
        COUNT(DISTINCT pp.PhoneNumber) AS unique_values,
        COUNT(*) AS total_records,
        'Customer support, delivery coordination' AS business_purpose
    FROM Person.PersonPhone pp
    UNION ALL
    SELECT 
        'Address Usage' AS usage_context,
        COUNT(DISTINCT CONCAT(pa.AddressLine1, pa.City)) AS unique_values,
        COUNT(*) AS total_records,
        'Shipping, billing, location services' AS business_purpose
    FROM Person.Address pa
)
SELECT * FROM PersonalDataUsage;

-- Generate personal titles dictionary
SELECT DISTINCT 
    Title AS term,
    'personal_title' AS category,
    'GDPR_PII' AS compliance_tag,
    'Personal title or honorific' AS description,
    COUNT(*) OVER (PARTITION BY Title) AS frequency
FROM Person.Person 
WHERE Title IS NOT NULL
ORDER BY frequency DESC;

Why this format: Each column serves a specific compliance purpose:

compliance_classification: Supports automated data handling rules
gdpr_article: Links to specific GDPR requirements for audit trails
data_subject_rights: Defines which rights apply to each data type
retention_period: Supports automated data lifecycle management
business_context: Required for data protection impact assessments (DPIAs)

term,category,compliance_classification,gdpr_article,data_subject_rights,retention_period,description,business_context
Mr.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Personal title or honorific,Customer identification and communication
Ms.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Personal title or honorific,Customer identification and communication
Mrs.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Personal title or honorific,Customer identification and communication
Dr.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Professional title,Customer identification and communication
Sr.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Senior designation title,Customer identification and communication
Sra.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Spanish female title,Customer identification and communication
email_address,contact_info,PII_HIGH,Art_4_1,access_rectify_delete_portability,7_years,Electronic contact information,Customer communication and account management
phone_number,contact_info,PII_MEDIUM,Art_4_1,access_rectify_delete,7_years,Telephone contact information,Customer support and delivery coordination
home_address,location_data,PII_HIGH,Art_4_1,access_rectify_delete,7_years,Residential address information,Shipping and billing services
work_address,location_data,PII_MEDIUM,Art_4_1,access_rectify_delete,7_years,Business address information,B2B communication and delivery
first_name,personal_name,PII_MEDIUM,Art_4_1,access_rectify_delete,indefinite,Given name of individual,Customer identification and personalization
last_name,personal_name,PII_MEDIUM,Art_4_1,access_rectify_delete,indefinite,Family name of individual,Customer identification and legal documentation
middle_name,personal_name,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Middle name or initial,Complete customer identification
suffix,personal_name,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Name suffix (Jr. Sr. III),Customer identification and family relationships
birth_date,sensitive_personal,PII_HIGH,Art_9,explicit_consent_required,special_category,Date of birth,Age verification and personalized services

GDPR compliance requires detailed documentation of data processing activities, legal bases, and data subject rights. This JSON structure provides all necessary metadata for automated compliance monitoring and reporting.

{
  "name": "Personal Data Identifiers - GDPR Compliant",
  "description": "Dictionary for identifying personal data elements subject to GDPR protection and data subject rights",
  "category": "Data Privacy & Protection",
  "version": "1.0",
  "lastUpdated": "2025-08-06",
  "regulatoryFramework": {
    "primary": "GDPR",
    "articles": ["Art_4_1", "Art_9", "Art_17", "Art_20"],
    "applicability": "EU residents and EEA",
    "dataSubjectRights": ["access", "rectification", "erasure", "portability", "restriction"]
  },
  "caseSensitive": false,
  "matchType": "exact",
  "automatedActions": {
    "highSensitivity": "encrypt_and_log",
    "mediumSensitivity": "log_access", 
    "specialCategory": "explicit_consent_required"
  },
  "auditRequirements": {
    "logAccess": true,
    "logModification": true,
    "retentionTracking": true,
    "consentTracking": true
  },
  "columns": [
    {
      "name": "term",
      "description": "Personal data identifier or value",
      "required": true
    },
    {
      "name": "category",
      "description": "Type of personal data (personal_title, contact_info, location_data, personal_name, sensitive_personal)",
      "required": true
    },
    {
      "name": "compliance_classification",
      "description": "GDPR sensitivity level (PII_LOW, PII_MEDIUM, PII_HIGH)",
      "required": true
    },
    {
      "name": "gdpr_article",
      "description": "Relevant GDPR article reference",
      "required": true
    },
    {
      "name": "data_subject_rights",
      "description": "Applicable data subject rights for this data type",
      "required": true
    },
    {
      "name": "retention_period",
      "description": "Legal retention requirement",
      "required": true
    },
    {
      "name": "description",
      "description": "Business definition of the data element",
      "required": true
    },
    {
      "name": "business_context",
      "description": "Business purpose and usage context for DPIA requirements",
      "required": true
    }
  ],
  "complianceValidation": {
    "requiredFields": ["term", "category", "compliance_classification"],
    "sensitivityLevels": ["PII_LOW", "PII_MEDIUM", "PII_HIGH"],
    "retentionValidation": true
  }
}

This method uses Pentaho Data Catalog's import functionality to upload ZIP files containing CSV and JSON dictionary definitions.

Pentaho Data Catalog requires dictionaries to be imported as ZIP files containing both CSV data and JSON configuration to minimize the risk of errors due to manual input.

Create the following Dictionary folders:

cd

Navigate to Dictionaries:
- In the left navigation menu, click Data Operations
- On the Data Identification Methods card, click Dictionaries
Import Process:
- Click Import button on the Dictionaries page
- Upload each ZIP file containing the dictionary CSV and JSON
- Click Continue to start the upload process
- Monitor progress using View Workers to track the import
Validation:
- Verify each dictionary appears in the Dictionaries list
- Check the Rules tab to review the JSON configuration logic
- Confirm categories and tags are properly assigned

Personal Data Identifiers Dictionary

This method demonstrates creating dictionaries entirely through the Pentaho Data Catalog interface by uploading CSV files and configuring JSON definitions through the UI.

The name and category are mandatory fields that help organize dictionaries for data governance. Categories assist in data discovery but are not related to Business Glossary categories

Navigate to: Data Operations > Dictionaries
Click: Add Dictionary
Configure Basic Information:

Parameter

Value

Name:

Personal Data Identifiers - GDPR Compliant

Description

Dictionary for identifying personal data elements subject to GDPR protection

Personal Data Identifier Dictionary (GDPR Compliance)

Schema Analysis

Personal Data Identifiers Dictionary