Personal Data Identifier

GDPR Compliance ..

Personal Data Identifier Dictionary (GDPR Compliance)

Why does this matter?

Why Regulatory Compliance Matters for Data Dictionaries

Organizations must comply with data privacy regulations that cover 75% of the world's population by the end of 2024, making data classification through dictionaries a critical compliance tool. Data dictionaries serve as the foundation for:

  • Data Classification: Visual labeling, metadata application, and automated data discovery to meet compliance requirements

  • Regulatory Reporting: Providing audit trails for data protection impact assessments (DPIAs)

  • Risk Management: Guarding against accidental data loss and enabling early detection of risky user behavior

Business Purpose: GDPR requires explicit identification and special handling of personal data. This dictionary automatically classifies columns containing personal identifiers to ensure proper data handling and support data subject rights.

The process follows a structured, four-phase approach that progressively builds capabilities from foundational pattern-based detection through to enterprise-wide privacy operations integration.

Phase 1 establishes the critical foundation by implementing dictionary-based identification using metadata hints, pattern matching, and automated tagging within structured databases. This initial phase enables organizations to quickly identify and classify the majority of obvious PII while building the governance framework, processes, and documentation required for regulatory compliance. Upon completion of Phase 1, organizations should plan for

Phase 2 (content inspection and sampling),

Phase 3 (dataflow mapping and lineage tracking), and

Phase 4 (full integration with privacy operations), which progressively enhance detection accuracy, expand coverage to unstructured data, and integrate PII management into broader enterprise systems.

Phase
Focus Area
Description
Key Deliverables

Phase 1: Pattern-Based Discovery (This Workshop)

Structured database PII identification

Implements dictionary-based identification using column name patterns, regex matching, and metadata hints. Focuses on structured data within relational databases. Establishes foundational governance processes, documentation standards, and automated tagging mechanisms.

• Data dictionaries for all PII categories • Automated tagging rules • GDPR compliance documentation • Initial data inventory • Governance procedures

Phase 2: Content Inspection (Future)

Actual data value analysis

Expands beyond metadata to inspect actual data values using sampling, statistical analysis, and machine learning. Detects PII in free-text fields, comments, and unstructured content. Implements Named Entity Recognition (NER) for contextual PII identification.

• Content inspection rules • ML-based classifiers • False positive reduction • Unstructured data coverage • Enhanced accuracy metrics

Phase 3: Data Flow Mapping (Future)

End-to-end lineage tracking

Maps PII movement through ETL pipelines, APIs, reports, exports, and data integrations. Identifies downstream systems receiving PII. Tracks data transformations, aggregations, and derivations to understand complete data lifecycle.

• Complete data lineage maps • API/interface PII exposure analysis • ETL pipeline documentation • Report/dashboard PII tracking • Cross-system impact analysis

Phase 4: Privacy Operations Integration (Future)

Enterprise-wide privacy ecosystem

Integrates PII identification with data masking, access controls, consent management, and data subject request fulfillment. Implements automated breach notification scope assessment, retention policy enforcement, and continuous compliance monitoring.

• Integrated masking policies • Automated DSR fulfillment • Consent tracking integration • Breach assessment automation • Real-time compliance dashboards


GDPR - PII Dictionary

This workshop comprehensively covers Phase 1, providing all necessary resources, templates, procedures, and documentation to successfully implement pattern-based PII discovery.

x

Database Inventory

Objective: Create a complete inventory of all database schemas and tables to ensure comprehensive coverage.

Why This Matters: GDPR compliance requires documenting all data sources within the organization. This inventory forms the foundation for regulatory compliance.

Schema Analysis

The first tasks in our journey was to conduct a comprehensive Database Inventory for compliance.

Data governance requires a complete inventory of all data assets to ensure comprehensive coverage. This inventory forms the foundation for regulatory compliance by documenting all data sources within the organization.

For Phase 1 we're going to take the traditional route and run some SQL scripts - later we'#ll take a look at a 'Project' that leverages ML.

Why This Matters: GDPR compliance requires documenting all data sources within the organization. This enhanced inventory not only catalogs schemas but automatically identifies PII exposure, calculates risk scores, and prioritizes compliance efforts based on actual data sensitivity.

Schema Analysis - Version 2

Expected Results

AdventureWorks2022 contains 5 main schemas with automated PII risk assessment:

Schema
Tables
Rows
Size (MB)
High Risk PII
Medium Risk PII
PII Risk Score
Priority
Data Category

Person

13

~1,755,295

~460

8

13

212.86

CRITICAL

Personal Data

HumanResources

6

~8,179

~6.5

1

1

52.5

CRITICAL

Employee Data

Sales

19

~2,658,063

~287

4

2

36.5

HIGH

Transaction Data

Purchasing

5

~155,373

~16

1

0

20.41

HIGH

Operational Data

Production

25

~3,417,306

~251

1

1

10.06

MEDIUM

Operational Data

dbo

3

~38,308

~51

0

0

0.00

LOW

System Data

Understanding the Output

  • high_risk_pii_columns: Count of columns containing emails, phones, SSN, passwords, credit cards

  • medium_risk_pii_columns: Count of columns with names, addresses, birth dates, postal codes

  • low_risk_pii_columns: Count of columns with titles, suffixes, gender, marital status

  • pii_risk_score: Weighted calculation (High×10 + Medium×5 + Low×2) ÷ Total Columns × 100

  • compliance_priority: Automatic prioritization using the following thresholds:

    • CRITICAL: 3+ high-risk columns OR risk score ≥ 40

    • HIGH: Any high-risk columns OR risk score ≥ 15

    • MEDIUM: Any medium-risk columns OR risk score ≥ 5

    • LOW: Only low-risk columns OR risk score < 5

  • data_category: Automatic classification based on schema naming patterns

Key Insights

Person schema is CRITICAL priority with a PII risk score of 212.86 - contains 8 high-risk and 13 medium-risk PII columns spanning 1.7M+ records. This should be your primary focus for dictionary creation.

  1. HumanResources schema also requires immediate CRITICAL attention with a risk score of 52.5 due to employee data sensitivity.

  2. Sales and Purchasing schemas are HIGH priority containing customer contact information and vendor data.

  3. Production schema has MEDIUM priority with minimal PII exposure (1 high, 1 medium risk column).

  4. dbo schema shows LOW priority with no PII columns detected - likely contains system/configuration data.

Deliverable:

  1. Export this result set to Excel for reference throughout the project

  2. Use the compliance_priority column to sequence your dictionary creation work

  3. Document the pii_risk_score baseline for quarterly tracking of new PII exposure


  1. In DBeaver run the following script:

Last updated

Was this helpful?