# Personal Data Identifier {% hint style="success" %} #### Personal Data Identifier Dictionary (GDPR Compliance) This comprehensive workshop demonstrates how to create data dictionaries that not only improve data discovery but also ensure regulatory compliance and support robust data governance frameworks. Using Microsoft's AdventureWorks2022 database as our foundation, participants will learn to build data dictionaries that meet the stringent requirements of GDPR, SOX, and other regulatory frameworks while establishing best practices for ongoing metadata management. By the end of this workshop, you will be able to: * Create comprehensive data dictionaries that support regulatory compliance requirements * Understand the relationship between data dictionaries, data lineage, and metadata management * Implement data classification schemes that meet industry-specific regulatory standards * Establish processes for dictionary maintenance that ensure ongoing compliance * Integrate dictionaries with broader data governance and security frameworks * Design audit trails and documentation that satisfy regulatory requirements {% endhint %}

Why does this matter?

**Why Regulatory Compliance Matters for Data Dictionaries** Organizations must comply with data privacy regulations that cover 75% of the world's population by the end of 2024, making data classification through dictionaries a critical compliance tool. Data dictionaries serve as the foundation for: * **Data Classification:** Visual labeling, metadata application, and automated data discovery to meet compliance requirements * **Regulatory Reporting:** Providing audit trails for data protection impact assessments (DPIAs) * **Risk Management:** Guarding against accidental data loss and enabling early detection of risky user behavior **Business Purpose:** GDPR requires explicit identification and special handling of personal data. This dictionary automatically classifies columns containing personal identifiers to ensure proper data handling and support data subject rights.

{% hint style="info" %} The process follows a structured, four-phase approach that progressively builds capabilities from foundational pattern-based detection through to enterprise-wide privacy operations integration. **Phase 1** establishes the critical foundation by implementing dictionary-based identification using metadata hints, pattern matching, and automated tagging within structured databases. This initial phase enables organizations to quickly identify and classify the majority of obvious PII while building the governance framework, processes, and documentation required for regulatory compliance. Upon completion of Phase 1, organizations should plan for **Phase 2** (content inspection and sampling), **Phase 3** (dataflow mapping and lineage tracking), and **Phase 4** (full integration with privacy operations), which progressively enhance detection accuracy, expand coverage to unstructured data, and integrate PII management into broader enterprise systems. {% endhint %}

Phase	Focus Area	Description	Key Deliverables
Phase 1: Pattern-Based Discovery (This Workshop)	Structured database PII identification	Implements dictionary-based identification using column name patterns, regex matching, and metadata hints. Focuses on structured data within relational databases. Establishes foundational governance processes, documentation standards, and automated tagging mechanisms.	• Data dictionaries for all PII categories • Automated tagging rules • GDPR compliance documentation • Initial data inventory • Governance procedures
Phase 2: Content Inspection (Future)	Actual data value analysis	Expands beyond metadata to inspect actual data values using sampling, statistical analysis, and machine learning. Detects PII in free-text fields, comments, and unstructured content. Implements Named Entity Recognition (NER) for contextual PII identification.	• Content inspection rules • ML-based classifiers • False positive reduction • Unstructured data coverage • Enhanced accuracy metrics
Phase 3: Data Flow Mapping (Future)	End-to-end lineage tracking	Maps PII movement through ETL pipelines, APIs, reports, exports, and data integrations. Identifies downstream systems receiving PII. Tracks data transformations, aggregations, and derivations to understand complete data lifecycle.	• Complete data lineage maps • API/interface PII exposure analysis • ETL pipeline documentation • Report/dashboard PII tracking • Cross-system impact analysis
Phase 4: Privacy Operations Integration (Future)	Enterprise-wide privacy ecosystem	Integrates PII identification with data masking, access controls, consent management, and data subject request fulfillment. Implements automated breach notification scope assessment, retention policy enforcement, and continuous compliance monitoring.	• Integrated masking policies • Automated DSR fulfillment • Consent tracking integration • Breach assessment automation • Real-time compliance dashboards

*** {% hint style="info" %} #### GDPR - PII Dictionary This workshop comprehensively covers **Phase 1**, providing all necessary resources, templates, procedures, and documentation to successfully implement pattern-based PII discovery. {% endhint %} x {% tabs %} {% tab title="1. Database Inventory" %} {% hint style="info" %} #### Database Inventory **Objective:** Create a complete inventory of all database schemas and tables to ensure comprehensive coverage. **Why This Matters:** GDPR compliance requires documenting all data sources within the organization. This inventory forms the foundation for regulatory compliance. {% endhint %} {% tabs %} {% tab title="Schema Analysis" %} {% hint style="info" %} #### Schema Analysis The first tasks in our journey was to conduct a comprehensive Database Inventory for compliance. Data governance requires a complete inventory of all data assets to ensure comprehensive coverage. This inventory forms the foundation for regulatory compliance by documenting all data sources within the organization. For Phase 1 we're going to take the traditional route and run some SQL scripts - later we'#ll take a look at a 'Project' that leverages ML. **Why This Matters:** GDPR compliance requires documenting all data sources within the organization. This enhanced inventory not only catalogs schemas but automatically identifies PII exposure, calculates risk scores, and prioritizes compliance efforts based on actual data sensitivity. {% endhint %}

{% hint style="info" %} #### Expected Results AdventureWorks2022 contains 5 main schemas with automated PII risk assessment: {% endhint %}

Schema	Tables	Rows	Size (MB)	High Risk PII	Medium Risk PII	PII Risk Score	Priority	Data Category
Person	13	~1,755,295	~460	8	13	212.86	CRITICAL	Personal Data
HumanResources	6	~8,179	~6.5	1	1	52.5	CRITICAL	Employee Data
Sales	19	~2,658,063	~287	4	2	36.5	HIGH	Transaction Data
Purchasing	5	~155,373	~16	1	0	20.41	HIGH	Operational Data
Production	25	~3,417,306	~251	1	1	10.06	MEDIUM	Operational Data
dbo	3	~38,308	~51	0	0	0.00	LOW	System Data

{% hint style="info" %} #### Understanding the Output * **high\_risk\_pii\_columns:** Count of columns containing emails, phones, SSN, passwords, credit cards * **medium\_risk\_pii\_columns:** Count of columns with names, addresses, birth dates, postal codes * **low\_risk\_pii\_columns:** Count of columns with titles, suffixes, gender, marital status * **pii\_risk\_score:** Weighted calculation (High×10 + Medium×5 + Low×2) ÷ Total Columns × 100 * **compliance\_priority:** Automatic prioritization using the following thresholds: * **CRITICAL:** 3+ high-risk columns OR risk score ≥ 40 * **HIGH:** Any high-risk columns OR risk score ≥ 15 * **MEDIUM:** Any medium-risk columns OR risk score ≥ 5 * **LOW:** Only low-risk columns OR risk score < 5 * **data\_category:** Automatic classification based on schema naming patterns {% endhint %} {% hint style="info" %} #### Key Insights **Person schema** is CRITICAL priority with a PII risk score of 212.86 - contains 8 high-risk and 13 medium-risk PII columns spanning 1.7M+ records. This should be your primary focus for dictionary creation. 1. **HumanResources schema** also requires immediate CRITICAL attention with a risk score of 52.5 due to employee data sensitivity. 2. **Sales and Purchasing schemas** are HIGH priority containing customer contact information and vendor data. 3. **Production schema** has MEDIUM priority with minimal PII exposure (1 high, 1 medium risk column). 4. **dbo schema** shows LOW priority with no PII columns detected - likely contains system/configuration data. **Deliverable:** 1. Export this result set to Excel for reference throughout the project 2. Use the `compliance_priority` column to sequence your dictionary creation work 3. Document the `pii_risk_score` baseline for quarterly tracking of new PII exposure {% endhint %} *** 1. In DBeaver run the following script: ```sql -- ============================================================================ -- ENHANCED SCHEMA ANALYSIS FOR GDPR COMPLIANCE - COMPLETE EDITION -- ============================================================================ -- Purpose: Comprehensive database inventory with automated PII risk indicators -- Target: SQL Server (AdventureWorks 2022 or similar) -- Version: 2.1 (Updated with improved priority logic and detailed comments) -- Author: Data Governance Team -- Last Updated: 2025-10-04 -- ============================================================================ -- -- WHAT THIS SCRIPT DOES: -- - Automatically scans all database schemas for PII-related columns -- - Calculates risk scores based on column naming patterns -- - Assigns compliance priorities (CRITICAL/HIGH/MEDIUM/LOW) -- - Provides data volume metrics and relationship analysis -- - Generates export-ready reports for compliance documentation -- -- PREREQUISITES: -- - SQL Server 2016+ (for STRING_AGG function) -- - Read permissions on sys schema catalog views -- - Database with profiled/populated tables for accurate row counts -- -- ============================================================================ -- ============================================================================ -- VERSION 1: BASIC ENHANCED (Quick Overview with Size Metrics) -- ============================================================================ -- USE CASE: Quick schema overview with data volume metrics -- OUTPUT: Schema summary with row counts, sizes, and last modified dates -- RUNTIME: Fast (<5 seconds for most databases) -- WHEN TO USE: Daily monitoring, quick health checks, initial assessment -- ============================================================================ SELECT -- Schema name from system catalog s.name AS schema_name, -- Count distinct tables in each schema COUNT(DISTINCT t.name) AS table_count, -- Sum all rows across tables in schema -- Uses sys.partitions for accurate row counts from indexes SUM(p.rows) AS total_rows, -- Calculate total storage size in MB -- total_pages = 8KB pages, convert to MB by: (pages * 8) / 1024 SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS total_size_mb, -- Count all columns across all tables in schema COUNT(DISTINCT c.column_id) AS total_columns, -- Get most recent modification date across all tables MAX(t.modify_date) AS last_modified, -- Create ordered, comma-separated list of tables -- Note: STRING_AGG requires SQL Server 2017+. See alternative below for older versions. STRING_AGG(t.name, ', ') WITHIN GROUP (ORDER BY t.name) AS table_list -- ALTERNATIVE for SQL Server 2016 and earlier: -- STUFF((SELECT ', ' + t2.name -- FROM sys.tables t2 -- WHERE t2.schema_id = s.schema_id -- ORDER BY t2.name -- FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 2, '') AS table_list FROM sys.schemas s -- LEFT JOIN ensures we see schemas even if they have no tables LEFT JOIN sys.tables t ON s.schema_id = t.schema_id -- Get row counts from partitions (index 0=heap, 1=clustered index) LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1) -- Get storage allocation data for size calculations LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id -- Get column information for column counts LEFT JOIN sys.columns c ON t.object_id = c.object_id -- Filter out system schemas that aren't relevant for GDPR analysis WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest', 'db_owner', 'db_accessadmin', 'db_securityadmin', 'db_ddladmin', 'db_backupoperator', 'db_datareader', 'db_datawriter', 'db_denydatareader', 'db_denydatawriter') GROUP BY s.name ORDER BY -- Custom sorting: PII-sensitive schemas first, then by row count CASE WHEN s.name = 'Person' THEN 1 -- Highest priority WHEN s.name LIKE '%HR%' OR s.name LIKE '%Human%' THEN 2 WHEN s.name LIKE '%Customer%' OR s.name LIKE '%Sales%' THEN 3 ELSE 4 -- Everything else END, total_rows DESC; -- Within same priority level, show largest first -- ============================================================================ -- VERSION 2: COMPLIANCE-FOCUSED (⭐ RECOMMENDED FOR GDPR ANALYSIS) -- ============================================================================ -- USE CASE: Complete PII risk assessment with automated prioritization -- OUTPUT: Schema-level PII detection, risk scoring, and compliance priority -- RUNTIME: Medium (~15-30 seconds for large databases) -- WHEN TO USE: Quarterly compliance reviews, GDPR audits, dictionary planning -- -- KEY FEATURES: -- - Automatic PII column detection across 3 risk levels (high/medium/low) -- - Calculated PII risk score (0-100 scale) -- - Smart compliance prioritization (CRITICAL/HIGH/MEDIUM/LOW) -- - Automatic data category classification -- - Relationship complexity analysis -- -- CSV EXPORT INSTRUCTIONS: -- Method 1 (DBeaver): Run query → Right-click results → Export → CSV -- Method 2 (SSMS): Run query → Right-click results → Save Results As → CSV -- Method 3 (Command Line): Use the export wrapper at the bottom of this script -- ============================================================================ -- CTE 1: SchemaMetrics - Gather basic metrics for all tables WITH SchemaMetrics AS ( SELECT s.name AS schema_name, t.name AS table_name, -- Row count from partition statistics -- More accurate than COUNT(*) and much faster SUM(p.rows) AS row_count, -- Table size calculation in MB -- total_pages includes data + indexes + LOB storage SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS size_mb, -- Column count per table COUNT(DISTINCT c.column_id) AS column_count, -- Metadata timestamps for change tracking t.modify_date, t.create_date FROM sys.schemas s INNER JOIN sys.tables t ON s.schema_id = t.schema_id LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1) LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id LEFT JOIN sys.columns c ON t.object_id = c.object_id -- Exclude system schemas that have no business/personal data WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest', 'db_owner', 'db_accessadmin', 'db_securityadmin', 'db_ddladmin', 'db_backupoperator', 'db_datareader', 'db_datawriter', 'db_denydatareader', 'db_denydatawriter') GROUP BY s.name, t.name, t.modify_date, t.create_date ), -- CTE 2: PIIIndicators - Detect PII columns using naming pattern analysis -- This CTE categorizes columns into risk levels based on GDPR sensitivity PIIIndicators AS ( SELECT SCHEMA_NAME(t.schema_id) AS schema_name, t.name AS table_name, -- HIGH RISK COLUMNS (PII_CRITICAL) -- These columns typically contain highly sensitive personal data -- GDPR Articles 4, 9 - Special categories requiring explicit consent COUNT(DISTINCT CASE WHEN c.name LIKE '%Email%' OR c.name LIKE '%Mail%' THEN c.column_id -- Email addresses: Direct contact, unique identifier WHEN c.name LIKE '%Phone%' OR c.name LIKE '%Mobile%' THEN c.column_id -- Phone numbers: Direct contact information WHEN c.name LIKE '%SSN%' OR c.name LIKE '%National%ID%' THEN c.column_id -- National identifiers: Government ID, highly sensitive WHEN c.name LIKE '%Password%' OR c.name LIKE '%PWD%' THEN c.column_id -- Credentials: Security risk if exposed WHEN c.name LIKE '%Credit%' OR c.name LIKE '%Card%' THEN c.column_id -- Financial data: Payment card information END) AS high_risk_columns, -- MEDIUM RISK COLUMNS (PII_MEDIUM) -- Personal identifiers that combined can identify individuals -- GDPR Article 4(1) - Personal data definition COUNT(DISTINCT CASE WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%' THEN c.column_id -- Names: Core personal identifiers WHEN c.name LIKE '%MiddleName%' THEN c.column_id -- Middle names: Additional personal identifier WHEN c.name LIKE '%Address%' OR c.name LIKE '%Street%' THEN c.column_id -- Physical addresses: Location data WHEN c.name LIKE '%City%' OR c.name LIKE '%PostalCode%' OR c.name LIKE '%Zip%' THEN c.column_id -- Location identifiers: Geographic personal data WHEN c.name LIKE '%BirthDate%' OR c.name LIKE '%DOB%' THEN c.column_id -- Birth dates: Age-sensitive personal information END) AS medium_risk_columns, -- LOW RISK COLUMNS (PII_LOW) -- Indirect identifiers, less sensitive but still personal data COUNT(DISTINCT CASE WHEN c.name LIKE '%Title%' AND SCHEMA_NAME(t.schema_id) = 'Person' THEN c.column_id -- Titles (Mr., Mrs., Dr.): Personal honorifics WHEN c.name LIKE '%Suffix%' THEN c.column_id -- Name suffixes (Jr., Sr., III): Family relationships WHEN c.name LIKE '%Gender%' OR c.name LIKE '%Sex%' THEN c.column_id -- Gender: Demographic personal data WHEN c.name LIKE '%Marital%' THEN c.column_id -- Marital status: Personal demographic information END) AS low_risk_columns FROM sys.tables t INNER JOIN sys.columns c ON t.object_id = c.object_id WHERE SCHEMA_NAME(t.schema_id) NOT IN ('sys', 'INFORMATION_SCHEMA') GROUP BY SCHEMA_NAME(t.schema_id), t.name ), -- CTE 3: RelationshipCount - Analyze foreign key relationships -- Used to understand data lineage and impact of schema changes RelationshipCount AS ( SELECT OBJECT_SCHEMA_NAME(fk.parent_object_id) AS schema_name, OBJECT_NAME(fk.parent_object_id) AS table_name, -- Outbound FKs: How many other tables this table references COUNT(DISTINCT fk.object_id) AS outbound_fk_count, -- Inbound FKs: How many tables reference this table COUNT(DISTINCT fk.referenced_object_id) AS inbound_fk_count FROM sys.foreign_keys fk GROUP BY OBJECT_SCHEMA_NAME(fk.parent_object_id), OBJECT_NAME(fk.parent_object_id) ) -- MAIN QUERY: Aggregate all metrics at schema level SELECT sm.schema_name, COUNT(DISTINCT sm.table_name) AS table_count, SUM(sm.row_count) AS total_rows, CAST(SUM(sm.size_mb) AS DECIMAL(10,2)) AS total_size_mb, SUM(sm.column_count) AS total_columns, -- PII Column Counts by Risk Level SUM(ISNULL(pii.high_risk_columns, 0)) AS high_risk_pii_columns, SUM(ISNULL(pii.medium_risk_columns, 0)) AS medium_risk_pii_columns, SUM(ISNULL(pii.low_risk_columns, 0)) AS low_risk_pii_columns, -- COMPLIANCE RISK SCORE CALCULATION (0-100 scale) -- Formula: (High×10 + Medium×5 + Low×2) ÷ Total Columns × 100 -- Rationale: -- - High risk weighted 10x (critical impact if breached) -- - Medium risk weighted 5x (significant impact) -- - Low risk weighted 2x (minor impact but still PII) -- - Normalized to total columns for fair comparison across schemas CAST( (SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 + SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 + SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / NULLIF(SUM(sm.column_count), 0) * 100 -- NULLIF prevents divide by zero AS DECIMAL(5,2)) AS pii_risk_score, -- COMPLIANCE PRIORITY ASSIGNMENT (UPDATED LOGIC v2.1) -- Uses both absolute column counts AND risk scores for balanced assessment -- -- CRITICAL: Schemas requiring immediate attention -- - 3+ high-risk columns (multiple sensitive data types present) -- - OR risk score ≥ 40 (>40% of columns are PII-weighted) -- Examples: Person, HumanResources schemas -- -- HIGH: Schemas requiring priority attention -- - Any high-risk columns (at least one critical PII type) -- - OR risk score ≥ 15 (>15% PII density) -- Examples: Sales, Customer schemas -- -- MEDIUM: Schemas requiring standard monitoring -- - Any medium-risk columns (personal identifiers present) -- - OR risk score ≥ 5 (>5% PII density) -- Examples: Operational schemas with occasional PII -- -- LOW: Schemas with minimal PII exposure -- - Only low-risk columns OR risk score < 5 -- Examples: Product, Configuration schemas CASE WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) >= 3) OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 + SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 + SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 40) THEN 'CRITICAL' WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) > 0) OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 + SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 + SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 15) THEN 'HIGH' WHEN (SUM(ISNULL(pii.medium_risk_columns, 0)) > 0) OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 + SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 + SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 5) THEN 'MEDIUM' ELSE 'LOW' END AS compliance_priority, -- DATA CATEGORY CLASSIFICATION -- Automatic classification based on schema naming conventions -- Helps organize compliance efforts by data type CASE WHEN sm.schema_name LIKE '%Person%' OR sm.schema_name LIKE '%Customer%' THEN 'Personal Data' WHEN sm.schema_name LIKE '%HR%' OR sm.schema_name LIKE '%Human%' OR sm.schema_name LIKE '%Employee%' THEN 'Employee Data' WHEN sm.schema_name LIKE '%Sales%' OR sm.schema_name LIKE '%Order%' THEN 'Transaction Data' WHEN sm.schema_name LIKE '%Financial%' OR sm.schema_name LIKE '%Payment%' THEN 'Financial Data' ELSE 'Operational Data' END AS data_category, -- Latest modification timestamp across all tables in schema MAX(sm.modify_date) AS last_modified, -- Ordered table list for reference -- STRING_AGG requires SQL Server 2017+ -- For SQL Server 2016 and earlier, use the commented alternative below STRING_AGG(sm.table_name, ', ') WITHIN GROUP (ORDER BY sm.table_name) AS table_list -- ALTERNATIVE for SQL Server 2016 and earlier (uncomment if needed): -- (SELECT STUFF((SELECT ', ' + sm2.table_name -- FROM SchemaMetrics sm2 -- WHERE sm2.schema_name = sm.schema_name -- ORDER BY sm2.table_name -- FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 2, '')) AS table_list FROM SchemaMetrics sm LEFT JOIN PIIIndicators pii ON sm.schema_name = pii.schema_name AND sm.table_name = pii.table_name LEFT JOIN RelationshipCount rc ON sm.schema_name = rc.schema_name AND sm.table_name = rc.table_name GROUP BY sm.schema_name -- ORDER BY: Prioritize results for action planning -- Primary sort: Compliance priority (CRITICAL first) -- Secondary sort: Total rows (larger datasets first within same priority) ORDER BY CASE WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) >= 3) OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 + SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 + SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 40) THEN 1 -- CRITICAL schemas first WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) > 0) OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 + SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 + SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 15) THEN 2 -- HIGH schemas second WHEN (SUM(ISNULL(pii.medium_risk_columns, 0)) > 0) OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 + SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 + SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 5) THEN 3 -- MEDIUM schemas third ELSE 4 -- LOW schemas last END, total_rows DESC; -- ============================================================================ -- CSV EXPORT WRAPPER FOR VERSION 2 -- ============================================================================ -- INSTRUCTIONS: -- 1. Uncomment the section below to export directly to CSV -- 2. Update the file path to your desired location -- 3. Ensure SQL Server has write permissions to the target directory -- 4. Run this entire script block -- ============================================================================ /* -- Enable xp_cmdshell if not already enabled (requires admin rights) -- EXEC sp_configure 'show advanced options', 1; -- RECONFIGURE; -- EXEC sp_configure 'xp_cmdshell', 1; -- RECONFIGURE; -- Export to CSV using BCP utility DECLARE @sql NVARCHAR(MAX); DECLARE @outputPath NVARCHAR(500) = 'C:\Exports\GDPR_Schema_Analysis.csv'; -- UPDATE THIS PATH SET @sql = 'bcp "' + DB_NAME() + '.dbo.vw_gdpr_compliance_export" queryout "' + @outputPath + '" -c -t, -T -S ' + @@SERVERNAME; EXEC xp_cmdshell @sql; */ -- ALTERNATIVE: Create a view for easier exports -- CREATE VIEW vw_gdpr_compliance_export AS -- [Paste the entire Version 2 query here] -- Then export view results using your preferred tool -- ============================================================================ -- PRIORITY THRESHOLDS EXPLAINED: -- ============================================================================ -- -- THRESHOLD DESIGN RATIONALE: -- These thresholds balance sensitivity with practicality -- -- CRITICAL: 3+ high-risk columns OR risk score ≥ 40 -- Why: Multiple high-risk columns indicate concentrated PII exposure -- Risk score ≥40 means >40% of schema contains weighted PII -- Action Required: Immediate dictionary creation, data masking, audit logging -- SLA: 1 week for initial assessment and protection -- -- HIGH: Any high-risk columns OR risk score ≥ 15 -- Why: Even single high-risk columns (email, SSN) require protection -- 15% threshold catches schemas with moderate PII concentration -- Action Required: Priority dictionary creation, access controls -- SLA: 2-3 weeks for protection implementation -- -- MEDIUM: Any medium-risk columns OR risk score ≥ 5 -- Why: Medium-risk columns (names, addresses) still qualify as personal data -- 5% threshold includes schemas with sparse but present PII -- Action Required: Standard dictionary creation, monitoring -- SLA: 4-6 weeks for implementation -- -- LOW: Only low-risk columns OR risk score < 5 -- Why: Minimal PII exposure, primarily operational data -- Action Required: Routine monitoring, no immediate action -- SLA: Quarterly review cycle -- -- ============================================================================ -- ============================================================================ -- VERSION 3: DETAILED TABLE-LEVEL ANALYSIS (Deep Dive) -- ============================================================================ -- USE CASE: Detailed analysis of specific schemas -- OUTPUT: Table-by-table breakdown with specific PII column identification -- RUNTIME: Fast for single schema (~3-5 seconds) -- WHEN TO USE: After Version 2 identifies high-priority schemas -- -- INSTRUCTIONS: -- 1. Replace 'Person' with your target schema name (line 270) -- 2. Run query to see table-level detail -- 3. Use results to plan dictionary terms and regex patterns -- ============================================================================ WITH TablePIIAnalysis AS ( SELECT SCHEMA_NAME(t.schema_id) AS schema_name, t.name AS table_name, t.create_date, t.modify_date, -- Table metrics SUM(p.rows) AS row_count, SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS size_mb, COUNT(DISTINCT c.column_id) AS column_count, -- PII COLUMN DETECTION WITH TYPE LABELS -- Creates human-readable list of PII columns with their types -- Format: "ColumnName (TYPE)", "ColumnName (TYPE)" -- This output is perfect for: -- - Building dictionary CSV files -- - Creating regex patterns -- - Documentation and audit reports STRING_AGG( CASE -- Email/Mail columns: Direct contact, unique identifier WHEN c.name LIKE '%Email%' OR c.name LIKE '%Mail%' THEN c.name + ' (EMAIL)' -- Phone columns: Contact information WHEN c.name LIKE '%Phone%' OR c.name LIKE '%Mobile%' OR c.name LIKE '%Fax%' THEN c.name + ' (PHONE)' -- Credential columns: Critical security data WHEN c.name LIKE '%Password%' OR c.name LIKE '%PWD%' THEN c.name + ' (PASSWORD)' -- National ID columns: Government identifiers WHEN c.name LIKE '%SSN%' OR c.name LIKE '%National%ID%' THEN c.name + ' (NATIONAL_ID)' -- Financial columns: Payment information WHEN c.name LIKE '%Credit%Card%' THEN c.name + ' (CREDIT_CARD)' -- Name columns: Personal identifiers WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%' OR c.name LIKE '%MiddleName%' THEN c.name + ' (NAME)' -- Address columns: Location data WHEN c.name LIKE '%Address%' OR c.name LIKE '%Street%' THEN c.name + ' (ADDRESS)' -- Birth date columns: Age-sensitive data WHEN c.name LIKE '%BirthDate%' OR c.name LIKE '%DOB%' THEN c.name + ' (BIRTHDATE)' -- Location columns: Geographic identifiers WHEN c.name LIKE '%City%' OR c.name LIKE '%Postal%' OR c.name LIKE '%Zip%' THEN c.name + ' (LOCATION)' END, ', ' -- Comma separator for readability ) AS potential_pii_columns, -- RELATIONSHIP COMPLEXITY ANALYSIS -- Understanding relationships helps with: -- - Data lineage mapping -- - Impact assessment for data changes -- - Identifying data flow patterns -- Outbound relationships: Tables this table depends on (SELECT COUNT(*) FROM sys.foreign_keys fk WHERE fk.parent_object_id = t.object_id) AS outbound_relationships, -- Inbound relationships: Tables that depend on this table (SELECT COUNT(*) FROM sys.foreign_keys fk WHERE fk.referenced_object_id = t.object_id) AS inbound_relationships FROM sys.tables t LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1) LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id LEFT JOIN sys.columns c ON t.object_id = c.object_id WHERE SCHEMA_NAME(t.schema_id) = 'Person' -- ⚠️ CHANGE SCHEMA NAME HERE GROUP BY SCHEMA_NAME(t.schema_id), t.name, t.create_date, t.modify_date, t.object_id ) SELECT schema_name, table_name, row_count, CAST(size_mb AS DECIMAL(10,2)) AS size_mb, column_count, potential_pii_columns, outbound_relationships, inbound_relationships, (outbound_relationships + inbound_relationships) AS total_relationships, -- TABLE-LEVEL PRIORITIZATION -- Assigns priority to individual tables within a schema -- Useful for sequencing dictionary creation work CASE WHEN potential_pii_columns LIKE '%PASSWORD%' OR potential_pii_columns LIKE '%CREDIT_CARD%' OR potential_pii_columns LIKE '%NATIONAL_ID%' THEN 'CRITICAL' WHEN potential_pii_columns LIKE '%EMAIL%' OR potential_pii_columns LIKE '%BIRTHDATE%' OR potential_pii_columns LIKE '%ADDRESS%' THEN 'HIGH' WHEN potential_pii_columns LIKE '%NAME%' OR potential_pii_columns LIKE '%PHONE%' THEN 'MEDIUM' WHEN potential_pii_columns IS NOT NULL THEN 'LOW' ELSE 'NONE' END AS pii_priority, -- Data freshness indicator DATEDIFF(DAY, modify_date, GETDATE()) AS days_since_modified, create_date, modify_date FROM TablePIIAnalysis ORDER BY -- Sort by priority level first CASE WHEN potential_pii_columns LIKE '%PASSWORD%' OR potential_pii_columns LIKE '%CREDIT_CARD%' THEN 1 WHEN potential_pii_columns LIKE '%EMAIL%' OR potential_pii_columns LIKE '%BIRTHDATE%' THEN 2 WHEN potential_pii_columns LIKE '%NAME%' OR potential_pii_columns LIKE '%PHONE%' THEN 3 WHEN potential_pii_columns IS NOT NULL THEN 4 ELSE 5 END, -- Then by row count (larger tables first) row_count DESC; -- ============================================================================ -- VERSION 4: EXPORT-READY FORMAT (For Documentation & Reports) -- ============================================================================ -- USE CASE: Clean output for compliance reports and executive summaries -- OUTPUT: Formatted data ready for Excel export or compliance documentation -- RUNTIME: Fast (<10 seconds) -- WHEN TO USE: Monthly/Quarterly compliance reports, board presentations -- -- KEY FEATURES: -- - Human-readable number formatting (with commas) -- - Binary indicators (Yes/No for PII presence) -- - Clean column names for non-technical audiences -- - Risk scores for quantitative tracking -- ============================================================================ WITH SchemaInventory AS ( SELECT -- Clean, business-friendly column aliases s.name AS [Schema Name], COUNT(DISTINCT t.name) AS [Table Count], -- Format numbers with thousand separators for readability -- FORMAT function converts to string with localization FORMAT(SUM(p.rows), 'N0') AS [Total Rows], -- Size in MB with 2 decimal places CAST(SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS DECIMAL(10,2)) AS [Size (MB)], COUNT(DISTINCT c.column_id) AS [Total Columns], -- Calculate risk score using same formula as Version 2 -- Stored in CTE for reuse in priority calculation CAST( (COUNT(DISTINCT CASE WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%' OR c.name LIKE '%Credit%' OR c.name LIKE '%National%ID%' THEN c.column_id END) * 10.0 + COUNT(DISTINCT CASE WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%' OR c.name LIKE '%Address%' OR c.name LIKE '%BirthDate%' THEN c.column_id END) * 5.0 + COUNT(DISTINCT CASE WHEN c.name LIKE '%Title%' OR c.name LIKE '%Gender%' THEN c.column_id END) * 2.0) / NULLIF(COUNT(DISTINCT c.column_id), 0) * 100 AS DECIMAL(5,2)) AS risk_score, -- Business-friendly data classification CASE WHEN s.name LIKE '%Person%' OR s.name LIKE '%Customer%' THEN 'Contains Personal Data' WHEN s.name LIKE '%HR%' OR s.name LIKE '%Human%' OR s.name LIKE '%Employee%' THEN 'Contains Employee Data' WHEN s.name LIKE '%Sales%' THEN 'Contains Transaction Data' ELSE 'Operational Data' END AS [Data Classification], -- Binary PII indicator for quick scanning CASE WHEN COUNT(DISTINCT CASE WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%' THEN c.column_id END) > 0 THEN 'Yes' ELSE 'No' END AS [Contains PII], -- Priority calculation with updated thresholds CASE WHEN COUNT(DISTINCT CASE WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%' THEN c.column_id END) >= 3 OR CAST((COUNT(DISTINCT CASE WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%' THEN c.column_id END) * 10.0) / NULLIF(COUNT(DISTINCT c.column_id), 0) * 100 AS DECIMAL(5,2)) >= 40 THEN 'CRITICAL' WHEN COUNT(DISTINCT CASE WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%' THEN c.column_id END) > 0 OR CAST((COUNT(DISTINCT CASE WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%' THEN c.column_id END) * 10.0) / NULLIF(COUNT(DISTINCT c.column_id), 0) * 100 AS DECIMAL(5,2)) >= 15 THEN 'HIGH' WHEN COUNT(DISTINCT CASE WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%Address%' THEN c.column_id END) > 0 THEN 'MEDIUM' ELSE 'LOW' END AS [Priority], -- Formatted date for reports FORMAT(MAX(t.modify_date), 'yyyy-MM-dd') AS [Last Modified], -- Semicolon-separated list (easier to read than commas in CSV) STRING_AGG(t.name, '; ') WITHIN GROUP (ORDER BY t.name) AS [Tables] FROM sys.schemas s LEFT JOIN sys.tables t ON s.schema_id = t.schema_id LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1) LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id LEFT JOIN sys.columns c ON t.object_id = c.object_id WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest') GROUP BY s.name ) SELECT [Schema Name], [Table Count], [Total Rows], [Size (MB)], [Total Columns], [Data Classification], [Contains PII], [Priority], CAST(risk_score AS DECIMAL(5,2)) AS [Risk Score], [Last Modified], [Tables] FROM SchemaInventory ORDER BY -- Sort by priority for executive summary CASE [Priority] WHEN 'CRITICAL' THEN 1 WHEN 'HIGH' THEN 2 WHEN 'MEDIUM' THEN 3 ELSE 4 END, -- Then by data volume CAST(REPLACE([Total Rows], ',', '') AS BIGINT) DESC; -- ============================================================================ -- COMPREHENSIVE USAGE GUIDE & RECOMMENDATIONS: -- ============================================================================ -- -- WORKFLOW SEQUENCE: -- -- 1. INITIAL ASSESSMENT (Week 1) -- → Run VERSION 2 (Compliance-Focused) -- → Identify CRITICAL and HIGH priority schemas -- → Document baseline risk scores -- → Present findings to stakeholders -- -- 2. DEEP DIVE ANALYSIS (Week 2-3) -- → Run VERSION 3 (Detailed) for each CRITICAL schema -- → List specific PII columns discovered -- → Map data relationships and dependencies -- → Create data flow diagrams -- -- 3. DICTIONARY PLANNING (Week 3-4) -- → Use VERSION 3 results to build dictionary CSV files -- → Extract actual values for dictionary terms -- → Create regex patterns based on column names -- → Define confidence scores and conditions -- -- 4. COMPLIANCE REPORTING (Ongoing) -- → Run VERSION 4 (Export-Ready) monthly/quarterly -- → Track risk score trends over time -- → Generate executive compliance reports -- → Document remediation progress -- -- 5. ROUTINE MONITORING (Ongoing) -- → Run VERSION 1 (Basic Enhanced) weekly -- → Alert on new tables in Person/HR schemas -- → Monitor schema modification dates -- → Track data volume growth -- -- INTEGRATION RECOMMENDATIONS: -- -- - Schedule VERSION 2 as SQL Agent Job (monthly) -- - Create alerts for risk score increases >10 points -- - Integrate results with data governance dashboard -- - Link to data lineage visualization tools -- - Feed into data catalog metadata -- -- CUSTOMIZATION TIPS: -- -- - Add industry-specific PII patterns (HIPAA, PCI-DSS, etc.) -- - Adjust risk score weights based on business impact -- - Modify priority thresholds to match org risk appetite -- - Extend data categories for domain-specific classification -- - Add columns for data retention policies -- -- ============================================================================ -- PII DETECTION PATTERNS REFERENCE: -- ============================================================================ -- -- HIGH RISK PATTERNS (Weight: 10x): -- %Email%, %Mail% → Email addresses (unique identifiers) -- %Phone%, %Mobile%, %Fax% → Phone numbers (direct contact) -- %SSN%, %National%ID% → Government identifiers -- %Password%, %PWD% → Authentication credentials -- %Credit%, %Card% → Payment card information -- -- MEDIUM RISK PATTERNS (Weight: 5x): -- %FirstName%, %LastName% → Personal names -- %MiddleName% → Additional name identifiers -- %Address%, %Street% → Physical addresses -- %City%, %PostalCode% → Location data -- %BirthDate%, %DOB% → Age-sensitive information -- -- LOW RISK PATTERNS (Weight: 2x): -- %Title% → Honorifics (Mr., Dr., etc.) -- %Suffix% → Name suffixes (Jr., Sr.) -- %Gender%, %Sex% → Demographic data -- %Marital% → Marital status -- -- TO EXTEND PATTERNS: -- 1. Add new CASE statements in PIIIndicators CTE -- 2. Update risk score calculation if adding new risk levels -- 3. Document new patterns in this reference section -- 4. Test on sample schemas before production deployment -- -- EXAMPLES OF EXTENSIONS: -- - HIPAA: %Diagnosis%, %Prescription%, %Treatment% -- - PCI-DSS: %CVV%, %CardHolder%, %Expiry% -- - Industry: %Contract%, %Agreement%, %NDA% -- -- ============================================================================ -- TROUBLESHOOTING: -- ============================================================================ -- -- Issue: Row counts showing as 0 -- Solution: Run UPDATE STATISTICS or REBUILD indexes -- -- Issue: STRING_AGG function not found -- Solution: Upgrade to SQL Server 2017+ or use FOR XML PATH alternative -- -- Issue: Performance slow on large databases -- Solution: Add WHERE clause to filter specific schemas -- Create indexed view on sys.columns for better performance -- -- Issue: Risk scores seem too high/low -- Solution: Adjust weights in risk score formula (currently 10/5/2) -- Review and update PII pattern matches -- -- Issue: Missing expected schemas in output -- Solution: Check schema name against exclusion list (line 41) -- Verify schema contains actual tables (not just empty schema) -- -- ============================================================================ -- END OF SCRIPT -- ============================================================================ ``` {% endtab %} {% tab title="Column Analysis" %} {% hint style="info" %} #### Column Analysis Data classification is fundamental to compliance with GDPR, HIPAA, and other regulations that require special handling of sensitive data categories. This automated classification helps identify data elements that require enhanced protection and specific dictionary treatment. **Compliance Impact:** This analysis identifies approximately 45+ potentially sensitive columns across the database, including: * **HIGH Sensitivity:** Email addresses, phone numbers, credit card information * **MEDIUM Sensitivity:** Names, addresses, demographic data * **LOW Sensitivity:** Product codes, business identifiers **Why This Matters:** Data classification is fundamental to GDPR, HIPAA, and other regulations requiring special handling of sensitive data categories. This automated approach ensures comprehensive coverage and reduces manual review time. {% endhint %}

{% hint style="info" %} #### Expected Results Approximately 45+ sensitive columns identified across AdventureWorks2022 with automated risk assessment: {% endhint %}

Schema	Sensitivity	Count	Example Columns	Risk Score Range	Priority
Person	PII_HIGH	8	EmailAddress, BirthDate, AddressLine1	150-250	URGENT/HIGH
Person	PII_MEDIUM	15	FirstName, LastName, PhoneNumber, City	80-150	HIGH/MEDIUM
Person	PII_LOW	3	Title, Suffix	30-50	LOW
HumanResources	PII_MEDIUM	5	JobTitle, HireDate	40-80	MEDIUM
Sales	PII_MEDIUM	8	CustomerID (string), SalesPersonID	60-120	MEDIUM
Sales	PII_HIGH	2	CreditCardApprovalCode	100-150	HIGH

{% hint style="info" %} #### Understanding the Output * **schema\_name, table\_name, column\_name:** Identifies the exact location of sensitive data * **data\_type:** SQL data type (helps validate PII detection accuracy) * **max\_length:** Maximum field length (context for string data) * **is\_nullable:** Whether column allows NULL values (data quality indicator) * **sensitivity\_level:** Risk classification * **PII\_CRITICAL:** Passwords, SSN, credit cards, medical records - requires immediate protection * **PII\_HIGH:** Email, birth dates, full addresses, biometric data - direct identifiers * **PII\_MEDIUM:** Names, phone numbers, city, salary - indirect identifiers when combined * **PII\_LOW:** Titles, country, language - minimal risk, still personal data * **data\_category:** Functional grouping for dictionary organization * `contact_info` - Email, phone, fax * `personal_name` - First, last, middle names * `location_data` - Addresses, GPS coordinates * `sensitive_personal` - Health, biometric, ethnicity (GDPR Article 9) * `financial_data` - Credit cards, salary, tax information * `authentication` - Passwords, login credentials * `government_id` - SSN, passport, driver's license * `personal_title` - Mr., Dr., Jr. * `identifier` - Employee ID, Customer ID * `demographic` - Gender, marital status, language * **gdpr\_article:** Maps to GDPR compliance requirements * `Art_9_Special_Categories` - Requires explicit consent and heightened protection * `Art_32_Security` - Mandates technical security measures * `Art_4_1_Personal_Data` - General personal data definition * **recommended\_retention:** Suggested data lifecycle * `Active_Account_Only` - Delete immediately when account closes * `7_years` - Standard business/financial records retention * `10_years` - Medical/health records retention * `Indefinite` - Core identity data with legitimate ongoing need * `5_years` - Default retention for other personal data * **table\_row\_count:** Number of records in the table (impact assessment) * **risk\_impact\_score:** Calculated metric combining sensitivity × data volume * Formula: Sensitivity weight (10-100) × LOG10(row\_count + 1) * Higher scores indicate both sensitive data AND high volume * Used to prioritize remediation efforts * **dictionary\_priority:** Action urgency for dictionary creation * **IMMEDIATE:** PII\_CRITICAL with >10K rows - start today * **URGENT:** PII\_CRITICAL or PII\_HIGH with >50K rows - start this week * **HIGH:** PII\_HIGH or PII\_MEDIUM with >100K rows - start this month * **MEDIUM:** PII\_MEDIUM standard volume - quarterly cycle * **LOW:** PII\_LOW - routine monitoring {% endhint %} {% hint style="info" %} #### Key Insights 1. **Person.EmailAddress** emerges as URGENT priority with PII\_HIGH classification and \~19,000 records, yielding a risk impact score of approximately 206. This should be the first dictionary created. 2. **Name columns** (FirstName, LastName, MiddleName) across Person schema total 15+ columns classified as PII\_MEDIUM with similar high volume, making them HIGH priority for consolidated dictionary treatment. 3. **HumanResources schema** contains employee salary and hire date information classified as PII\_MEDIUM, requiring separate dictionary handling due to different legal retention requirements (7 years vs indefinite). 4. **No PII\_CRITICAL columns** detected in AdventureWorks2022, which is expected for a sample database. Production databases typically contain passwords (hashed), credit card numbers, or SSN requiring immediate protection. 5. **Contact information dominates** with 23+ columns across email, phone, and address categories - these can be grouped into 3 related dictionaries: Contact Info, Location Data, and Personal Names. 6. **Sales schema** shows CustomerID as string-based (varchar), flagged as PII\_MEDIUM since string IDs could be email addresses or usernames. Numeric customer IDs would not be flagged. 7. **Title column** only flagged in Person/HumanResources schemas due to context-aware detection - the same column name in Production schema (product titles) correctly excluded. **Quality Validation Checklist:** * [ ] Review columns with `risk_impact_score` > 200 (highest risk) * [ ] Verify all email/phone columns detected (search for missed patterns) * [ ] Validate context-aware detection (Title in Person vs Production) * [ ] Check for industry-specific PII patterns not covered * [ ] Confirm data\_category assignments align with business understanding * [ ] Cross-reference with privacy team's known PII inventory * [ ] Document any regulatory-specific requirements (HIPAA, PCI-DSS, etc.) This detailed analysis provides the foundation for creating targeted, effective data dictionaries in Stage 2. {% endhint %} {% hint style="info" %} #### Deliverable 1. **Export Results to CSV:** * Right-click query results → Export → CSV format * Save as `sensitive_columns_detail_[DATE].csv` * This becomes your master PII inventory document 2. **Create Prioritized Action Plan:** ``` Week 1: IMMEDIATE/URGENT priorities - Person.EmailAddress dictionary - Any PII_CRITICAL findings (if discovered) Week 2-3: HIGH priorities - Personal names dictionary (FirstName, LastName, MiddleName) - Address/location dictionary (AddressLine, City, PostalCode) Week 4-6: MEDIUM priorities - Phone/contact dictionary - Demographic data dictionary - Employee data dictionary Ongoing: LOW priorities - Title/suffix dictionary - Quarterly review and updates ``` 3. **Document False Positives/Negatives:** * Review results for incorrectly classified columns * Document true PII columns missed by pattern matching * Note any context-specific exceptions * Feed findings back into pattern refinement 4. **Create Data Category Summary:** ``` Summary by Category: - contact_info: 8 columns → 1 dictionary - personal_name: 15 columns → 1 dictionary - location_data: 12 columns → 1 dictionary - sensitive_personal: 2 columns → 1 dictionary - demographic: 5 columns → 1 dictionary - personal_title: 3 columns → 1 dictionary Total: 6 dictionaries required for comprehensive coverage ``` 5. **Prepare for Stage 1.3 (Data Lineage Analysis):** * Use `table_name` column to identify high-impact tables * Tables with 5+ PII columns require detailed lineage mapping * Focus on Person.Person, Person.Address, Person.EmailAddress 6. **Generate Executive Summary:** ``` GDPR PII Discovery Summary - AdventureWorks2022 Total PII Columns Identified: 45+ - CRITICAL Priority: 0 (none detected) - HIGH Priority: 10 columns affecting ~19K records - MEDIUM Priority: 28 columns affecting ~8K records - LOW Priority: 7 columns Schemas Requiring Immediate Attention: 1. Person (26 PII columns) - URGENT 2. HumanResources (5 PII columns) - HIGH 3. Sales (8 PII columns) - MEDIUM Recommended Next Steps: - Create 6 data dictionaries (prioritized list attached) - Implement tagging automation in Pentaho Data Catalog - Schedule quarterly PII discovery reviews - Estimated effort: x weeks for initial dictionary creation ``` {% endhint %} *** 1. In DBeaver run the following script: ```sql -- ============================================================================ -- SENSITIVE COLUMN IDENTIFICATION FOR GDPR COMPLIANCE -- ============================================================================ -- Purpose: Automatically detect PII columns using pattern matching and context -- Output: Comprehensive list of sensitive columns with risk classification -- Method: Multi-pattern matching with data type validation -- ============================================================================ -- SENSITIVE COLUMN IDENTIFICATION FOR GDPR COMPLIANCE -- ============================================================================ -- Purpose: Automatically detect PII columns using pattern matching and context -- ============================================================================ -- SENSITIVE COLUMN IDENTIFICATION FOR GDPR COMPLIANCE -- ============================================================================ -- Purpose: Automatically detect PII columns using pattern matching and context -- Output: Comprehensive list of sensitive columns with risk classification -- Method: Multi-pattern matching with data type validation and volume analysis -- -- WHAT THIS SCRIPT DOES: -- 1. Scans all non-system tables for columns matching PII patterns -- 2. Classifies columns into 4 sensitivity levels (CRITICAL/HIGH/MEDIUM/LOW) -- 3. Groups columns into 10 functional categories for dictionary organization -- 4. Maps columns to GDPR articles for compliance documentation -- 5. Calculates risk impact scores based on sensitivity × data volume -- 6. Provides actionable priority rankings for dictionary creation -- -- DETECTION METHODOLOGY: -- - Pattern matching on column names (50+ patterns) -- - Data type validation (string-based IDs flagged, numeric IDs excluded) -- - Schema context awareness (Title in Person vs Production) -- - Row count analysis for impact assessment -- ============================================================================ ;WITH ColumnInventory AS ( SELECT SCHEMA_NAME(t.schema_id) AS schema_name, t.name AS table_name, c.name AS column_name, TYPE_NAME(c.user_type_id) AS data_type, c.max_length, c.is_nullable, c.column_id, (SELECT SUM(p.rows) FROM sys.partitions p WHERE p.object_id = t.object_id AND p.index_id IN (0, 1)) AS table_row_count, CASE WHEN c.name LIKE '%Password%' OR c.name LIKE '%PWD%' OR c.name LIKE '%Pass%' THEN 'PII_CRITICAL' WHEN c.name LIKE '%SSN%' OR c.name LIKE '%Social%Security%' THEN 'PII_CRITICAL' WHEN c.name LIKE '%National%ID%' OR c.name LIKE '%NationalID%' THEN 'PII_CRITICAL' WHEN c.name LIKE '%Credit%Card%' OR c.name LIKE '%CreditCard%' OR c.name LIKE '%CCNumber%' THEN 'PII_CRITICAL' WHEN c.name LIKE '%CVV%' OR c.name LIKE '%CVC%' OR c.name LIKE '%SecurityCode%' THEN 'PII_CRITICAL' WHEN c.name LIKE '%Account%Number%' AND TYPE_NAME(c.user_type_id) IN ('varchar', 'nvarchar', 'char', 'nchar') THEN 'PII_CRITICAL' WHEN c.name LIKE '%Passport%' OR c.name LIKE '%License%Number%' OR c.name LIKE '%DL%Number%' THEN 'PII_CRITICAL' WHEN c.name LIKE '%Medical%Record%' OR c.name LIKE '%Health%ID%' THEN 'PII_CRITICAL' WHEN c.name LIKE '%Email%' OR c.name LIKE '%Mail%' OR c.name LIKE '%E_Mail%' THEN 'PII_HIGH' WHEN c.name LIKE '%Address%Line%' OR c.name LIKE '%Street%Address%' OR c.name LIKE '%Physical%Address%' THEN 'PII_HIGH' WHEN c.name LIKE '%BirthDate%' OR c.name LIKE '%DOB%' OR c.name LIKE '%DateOfBirth%' THEN 'PII_HIGH' WHEN c.name LIKE '%IP%Address%' OR c.name LIKE '%IPAddress%' THEN 'PII_HIGH' WHEN c.name LIKE '%Biometric%' OR c.name LIKE '%Fingerprint%' OR c.name LIKE '%FaceID%' THEN 'PII_HIGH' WHEN c.name LIKE '%GPS%' OR c.name LIKE '%Latitude%' OR c.name LIKE '%Longitude%' OR c.name LIKE '%GeoLocation%' THEN 'PII_HIGH' WHEN c.name LIKE '%Tax%ID%' OR c.name LIKE '%TaxID%' OR c.name LIKE '%EIN%' THEN 'PII_HIGH' WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%First%Name%' OR c.name LIKE '%Given%Name%' OR c.name LIKE '%Forename%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%LastName%' OR c.name LIKE '%Last%Name%' OR c.name LIKE '%Surname%' OR c.name LIKE '%Family%Name%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%MiddleName%' OR c.name LIKE '%Middle%Name%' OR c.name LIKE '%Middle%Initial%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%Phone%' OR c.name LIKE '%Mobile%' OR c.name LIKE '%Telephone%' OR c.name LIKE '%Cell%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%Fax%' OR c.name LIKE '%Fax%Number%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%Address%' AND c.name NOT LIKE '%Email%' AND c.name NOT LIKE '%IP%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%City%' OR c.name LIKE '%Town%' OR c.name LIKE '%Municipality%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%PostalCode%' OR c.name LIKE '%Postal%Code%' OR c.name LIKE '%Zip%' OR c.name LIKE '%ZIP%Code%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%State%' OR c.name LIKE '%Province%' OR c.name LIKE '%Region%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%Gender%' OR c.name LIKE '%Sex%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%Marital%Status%' OR c.name LIKE '%MaritalStatus%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%Salary%' OR c.name LIKE '%Wage%' OR c.name LIKE '%Income%' OR c.name LIKE '%Compensation%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%Employee%ID%' OR c.name LIKE '%EmployeeID%' OR c.name LIKE '%Staff%ID%' THEN 'PII_MEDIUM' WHEN (c.name LIKE '%Customer%ID%' OR c.name LIKE '%CustomerID%') AND TYPE_NAME(c.user_type_id) IN ('varchar', 'nvarchar', 'char', 'nchar') THEN 'PII_MEDIUM' WHEN c.name LIKE '%User%Name%' OR c.name LIKE '%Username%' OR c.name LIKE '%Login%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%Age%' OR c.name LIKE '%Birth%Year%' THEN 'PII_MEDIUM' WHEN c.name LIKE '%Title%' AND SCHEMA_NAME(t.schema_id) IN ('Person', 'HumanResources', 'Customer') THEN 'PII_LOW' WHEN c.name LIKE '%Suffix%' OR c.name LIKE '%Name%Suffix%' THEN 'PII_LOW' WHEN c.name LIKE '%Nickname%' OR c.name LIKE '%Preferred%Name%' THEN 'PII_LOW' WHEN c.name LIKE '%Country%' OR c.name LIKE '%Nation%' THEN 'PII_LOW' WHEN c.name LIKE '%Language%' OR c.name LIKE '%Locale%' THEN 'PII_LOW' WHEN c.name LIKE '%Ethnicity%' OR c.name LIKE '%Race%' OR c.name LIKE '%Nationality%' THEN 'PII_LOW' WHEN c.name LIKE '%Religion%' OR c.name LIKE '%Political%' THEN 'PII_LOW' WHEN c.name LIKE '%Membership%' OR c.name LIKE '%Union%' THEN 'PII_LOW' ELSE NULL END AS sensitivity_level, CASE WHEN c.name LIKE '%Email%' OR c.name LIKE '%Mail%' OR c.name LIKE '%Phone%' OR c.name LIKE '%Mobile%' OR c.name LIKE '%Fax%' OR c.name LIKE '%Telephone%' THEN 'contact_info' WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%' OR c.name LIKE '%MiddleName%' OR c.name LIKE '%Surname%' OR c.name LIKE '%Given%Name%' OR c.name LIKE '%Forename%' OR c.name LIKE '%Family%Name%' OR c.name LIKE '%Nickname%' THEN 'personal_name' WHEN c.name LIKE '%Address%' OR c.name LIKE '%Street%' OR c.name LIKE '%City%' OR c.name LIKE '%PostalCode%' OR c.name LIKE '%Zip%' OR c.name LIKE '%State%' OR c.name LIKE '%Province%' OR c.name LIKE '%Country%' OR c.name LIKE '%GPS%' OR c.name LIKE '%Latitude%' OR c.name LIKE '%Longitude%' THEN 'location_data' WHEN c.name LIKE '%BirthDate%' OR c.name LIKE '%DOB%' OR c.name LIKE '%Age%' OR c.name LIKE '%Biometric%' OR c.name LIKE '%Health%' OR c.name LIKE '%Medical%' OR c.name LIKE '%Ethnicity%' OR c.name LIKE '%Race%' OR c.name LIKE '%Religion%' OR c.name LIKE '%Political%' OR c.name LIKE '%Union%' THEN 'sensitive_personal' WHEN c.name LIKE '%Credit%' OR c.name LIKE '%Account%Number%' OR c.name LIKE '%CVV%' OR c.name LIKE '%Salary%' OR c.name LIKE '%Wage%' OR c.name LIKE '%Income%' OR c.name LIKE '%Tax%ID%' THEN 'financial_data' WHEN c.name LIKE '%Password%' OR c.name LIKE '%PWD%' OR c.name LIKE '%User%Name%' OR c.name LIKE '%Login%' THEN 'authentication' WHEN c.name LIKE '%SSN%' OR c.name LIKE '%National%ID%' OR c.name LIKE '%Passport%' OR c.name LIKE '%License%' OR c.name LIKE '%DL%Number%' THEN 'government_id' WHEN c.name LIKE '%Title%' OR c.name LIKE '%Suffix%' THEN 'personal_title' WHEN c.name LIKE '%Employee%ID%' OR c.name LIKE '%Customer%ID%' OR c.name LIKE '%Staff%ID%' THEN 'identifier' WHEN c.name LIKE '%Gender%' OR c.name LIKE '%Marital%' OR c.name LIKE '%Language%' THEN 'demographic' ELSE NULL END AS data_category, CASE WHEN c.name LIKE '%Biometric%' OR c.name LIKE '%Health%' OR c.name LIKE '%Medical%' OR c.name LIKE '%Ethnicity%' OR c.name LIKE '%Race%' OR c.name LIKE '%Religion%' OR c.name LIKE '%Political%' OR c.name LIKE '%Union%' THEN 'Art_9_Special_Categories' WHEN c.name LIKE '%Password%' OR c.name LIKE '%Credit%' OR c.name LIKE '%SSN%' THEN 'Art_32_Security' ELSE 'Art_4_1_Personal_Data' END AS gdpr_article, CASE WHEN c.name LIKE '%Password%' THEN 'Active_Account_Only' WHEN c.name LIKE '%Credit%' OR c.name LIKE '%Payment%' THEN '7_years' WHEN c.name LIKE '%Tax%' OR c.name LIKE '%Salary%' THEN '7_years' WHEN c.name LIKE '%Medical%' OR c.name LIKE '%Health%' THEN '10_years' WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' THEN '7_years' WHEN c.name LIKE '%Address%' THEN '7_years' WHEN c.name LIKE '%BirthDate%' OR c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%' THEN 'Indefinite' ELSE '5_years' END AS recommended_retention FROM sys.tables t INNER JOIN sys.columns c ON t.object_id = c.object_id WHERE SCHEMA_NAME(t.schema_id) NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest') ) SELECT schema_name, table_name, column_name, data_type, max_length, CASE WHEN is_nullable = 1 THEN 'YES' ELSE 'NO' END AS is_nullable, sensitivity_level, data_category, gdpr_article, recommended_retention, table_row_count, CAST(CASE sensitivity_level WHEN 'PII_CRITICAL' THEN 100 WHEN 'PII_HIGH' THEN 50 WHEN 'PII_MEDIUM' THEN 25 WHEN 'PII_LOW' THEN 10 ELSE 0 END * LOG10(ISNULL(table_row_count, 0) + 1) AS DECIMAL(10,2)) AS risk_impact_score, CASE WHEN sensitivity_level = 'PII_CRITICAL' AND table_row_count > 10000 THEN 'IMMEDIATE' WHEN sensitivity_level = 'PII_CRITICAL' THEN 'URGENT' WHEN sensitivity_level = 'PII_HIGH' AND table_row_count > 50000 THEN 'URGENT' WHEN sensitivity_level = 'PII_HIGH' THEN 'HIGH' WHEN sensitivity_level = 'PII_MEDIUM' AND table_row_count > 100000 THEN 'HIGH' WHEN sensitivity_level = 'PII_MEDIUM' THEN 'MEDIUM' ELSE 'LOW' END AS dictionary_priority FROM ColumnInventory WHERE sensitivity_level IS NOT NULL ORDER BY CASE sensitivity_level WHEN 'PII_CRITICAL' THEN 1 WHEN 'PII_HIGH' THEN 2 WHEN 'PII_MEDIUM' THEN 3 WHEN 'PII_LOW' THEN 4 END, risk_impact_score DESC, schema_name, table_name, column_name; ``` {% endtab %} {% tab title="Data Lineage" %} {% hint style="info" %} Data lineage is crucial for compliance auditing and impact analysis, helping organizations understand how changes to data affect downstream processes. This relationship analysis forms the foundation for creating dictionaries that understand data dependencies. {% endhint %} 1. In DBeaver ```sql -- Analyze relationships and dependencies for lineage mapping WITH TableRelationships AS ( SELECT OBJECT_SCHEMA_NAME(fk.parent_object_id) AS child_schema, OBJECT_NAME(fk.parent_object_id) AS child_table, COL_NAME(fkc.parent_object_id, fkc.parent_column_id) AS child_column, OBJECT_SCHEMA_NAME(fk.referenced_object_id) AS parent_schema, OBJECT_NAME(fk.referenced_object_id) AS parent_table, COL_NAME(fkc.referenced_object_id, fkc.referenced_column_id) AS parent_column, fk.name AS constraint_name FROM sys.foreign_keys fk JOIN sys.foreign_key_columns fkc ON fk.object_id = fkc.constraint_object_id ) SELECT parent_schema + '.' + parent_table AS source_table, COUNT(DISTINCT child_schema + '.' + child_table) AS dependent_tables, STRING_AGG(child_schema + '.' + child_table, '; ') AS dependencies FROM TableRelationships GROUP BY parent_schema, parent_table HAVING COUNT(DISTINCT child_schema + '.' + child_table) > 2 ORDER BY dependent_tables DESC; ``` x {% hint style="info" %} **Expected Results:** Key tables with high dependencies include: * `Person.BusinessEntity` (15+ dependent tables) * `Person.Person` (12+ dependent tables) * `Production.Product` (8+ dependent tables) These high-dependency tables require special attention in dictionary design as changes affect multiple business processes. {% endhint %} x {% endtab %} {% endtabs %} {% endtab %} {% tab title="2. PII Dictionary " %} x x {% tabs %} {% tab title="PII Data Elements" %} x {% hint style="info" %} GDPR requires organizations to identify all personal data and understand its usage context for data protection impact assessments. The frequency analysis helps prioritize dictionary terms based on actual business usage. {% endhint %} 1. In DBeaver run the following script: ```sql -- Extract personal identifiers with business context and usage frequency WITH PersonalDataElements AS ( SELECT DISTINCT p.Title, p.FirstName, p.MiddleName, p.LastName, p.Suffix, pa.AddressLine1, pa.AddressLine2, pa.City, sp.StateProvinceCode, sp.Name as StateProvinceName, cr.CountryRegionCode, cr.Name as CountryRegionName, ea.EmailAddress, pp.PhoneNumber, pnt.Name as PhoneNumberType FROM Person.Person p LEFT JOIN Person.BusinessEntityAddress bea ON p.BusinessEntityID = bea.BusinessEntityID LEFT JOIN Person.Address pa ON bea.AddressID = pa.AddressID LEFT JOIN Person.StateProvince sp ON pa.StateProvinceID = sp.StateProvinceID LEFT JOIN Person.CountryRegion cr ON sp.CountryRegionCode = cr.CountryRegionCode LEFT JOIN Person.EmailAddress ea ON p.BusinessEntityID = ea.BusinessEntityID LEFT JOIN Person.PersonPhone pp ON p.BusinessEntityID = pp.BusinessEntityID LEFT JOIN Person.PhoneNumberType pnt ON pp.PhoneNumberTypeID = pnt.PhoneNumberTypeID WHERE p.PersonType = 'IN' -- Individual customers only for GDPR scope ), PersonalDataUsage AS ( -- Analyze how personal data is used across business processes SELECT 'Email Usage' AS usage_context, COUNT(DISTINCT ea.EmailAddress) AS unique_values, COUNT(*) AS total_records, 'Customer communication, account verification' AS business_purpose FROM Person.EmailAddress ea UNION ALL SELECT 'Phone Usage' AS usage_context, COUNT(DISTINCT pp.PhoneNumber) AS unique_values, COUNT(*) AS total_records, 'Customer support, delivery coordination' AS business_purpose FROM Person.PersonPhone pp UNION ALL SELECT 'Address Usage' AS usage_context, COUNT(DISTINCT CONCAT(pa.AddressLine1, pa.City)) AS unique_values, COUNT(*) AS total_records, 'Shipping, billing, location services' AS business_purpose FROM Person.Address pa ) SELECT * FROM PersonalDataUsage; -- Generate personal titles dictionary SELECT DISTINCT Title AS term, 'personal_title' AS category, 'GDPR_PII' AS compliance_tag, 'Personal title or honorific' AS description, COUNT(*) OVER (PARTITION BY Title) AS frequency FROM Person.Person WHERE Title IS NOT NULL ORDER BY frequency DESC; ``` x x x x x {% endtab %} {% tab title="PII CSV" %} x {% hint style="info" %} **Why this format:** Each column serves a specific compliance purpose: * **compliance\_classification:** Supports automated data handling rules * **gdpr\_article:** Links to specific GDPR requirements for audit trails * **data\_subject\_rights:** Defines which rights apply to each data type * **retention\_period:** Supports automated data lifecycle management * **business\_context:** Required for data protection impact assessments (DPIAs) {% endhint %} x ``` term,category,compliance_classification,gdpr_article,data_subject_rights,retention_period,description,business_context Mr.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Personal title or honorific,Customer identification and communication Ms.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Personal title or honorific,Customer identification and communication Mrs.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Personal title or honorific,Customer identification and communication Dr.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Professional title,Customer identification and communication Sr.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Senior designation title,Customer identification and communication Sra.,personal_title,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Spanish female title,Customer identification and communication email_address,contact_info,PII_HIGH,Art_4_1,access_rectify_delete_portability,7_years,Electronic contact information,Customer communication and account management phone_number,contact_info,PII_MEDIUM,Art_4_1,access_rectify_delete,7_years,Telephone contact information,Customer support and delivery coordination home_address,location_data,PII_HIGH,Art_4_1,access_rectify_delete,7_years,Residential address information,Shipping and billing services work_address,location_data,PII_MEDIUM,Art_4_1,access_rectify_delete,7_years,Business address information,B2B communication and delivery first_name,personal_name,PII_MEDIUM,Art_4_1,access_rectify_delete,indefinite,Given name of individual,Customer identification and personalization last_name,personal_name,PII_MEDIUM,Art_4_1,access_rectify_delete,indefinite,Family name of individual,Customer identification and legal documentation middle_name,personal_name,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Middle name or initial,Complete customer identification suffix,personal_name,PII_LOW,Art_4_1,access_rectify_delete,indefinite,Name suffix (Jr. Sr. III),Customer identification and family relationships birth_date,sensitive_personal,PII_HIGH,Art_9,explicit_consent_required,special_category,Date of birth,Age verification and personalized services ``` x x {% endtab %} {% tab title="PII JSON" %} x {% hint style="info" %} GDPR compliance requires detailed documentation of data processing activities, legal bases, and data subject rights. This JSON structure provides all necessary metadata for automated compliance monitoring and reporting. {% endhint %} x ```json { "name": "Personal Data Identifiers - GDPR Compliant", "description": "Dictionary for identifying personal data elements subject to GDPR protection and data subject rights", "category": "Data Privacy & Protection", "version": "1.0", "lastUpdated": "2025-08-06", "regulatoryFramework": { "primary": "GDPR", "articles": ["Art_4_1", "Art_9", "Art_17", "Art_20"], "applicability": "EU residents and EEA", "dataSubjectRights": ["access", "rectification", "erasure", "portability", "restriction"] }, "caseSensitive": false, "matchType": "exact", "automatedActions": { "highSensitivity": "encrypt_and_log", "mediumSensitivity": "log_access", "specialCategory": "explicit_consent_required" }, "auditRequirements": { "logAccess": true, "logModification": true, "retentionTracking": true, "consentTracking": true }, "columns": [ { "name": "term", "description": "Personal data identifier or value", "required": true }, { "name": "category", "description": "Type of personal data (personal_title, contact_info, location_data, personal_name, sensitive_personal)", "required": true }, { "name": "compliance_classification", "description": "GDPR sensitivity level (PII_LOW, PII_MEDIUM, PII_HIGH)", "required": true }, { "name": "gdpr_article", "description": "Relevant GDPR article reference", "required": true }, { "name": "data_subject_rights", "description": "Applicable data subject rights for this data type", "required": true }, { "name": "retention_period", "description": "Legal retention requirement", "required": true }, { "name": "description", "description": "Business definition of the data element", "required": true }, { "name": "business_context", "description": "Business purpose and usage context for DPIA requirements", "required": true } ], "complianceValidation": { "requiredFields": ["term", "category", "compliance_classification"], "sensitivityLevels": ["PII_LOW", "PII_MEDIUM", "PII_HIGH"], "retentionValidation": true } } ``` x x {% endtab %} {% endtabs %} x x x x {% endtab %} {% tab title="3. PII Dictionary Implementation" %} {% hint style="info" %} {% endhint %} x x {% tabs %} {% tab title="UI-Based Dictionary Creation" %} {% hint style="info" %} #### **Personal Data Identifiers Dictionary** This method demonstrates creating dictionaries entirely through the Pentaho Data Catalog interface by uploading CSV files and configuring JSON definitions through the UI. The name and category are mandatory fields that help organize dictionaries for data governance. Categories assist in data discovery but are not related to Business Glossary categories {% endhint %} 1. Navigate to: Data Operations > Dictionaries 2. Click: Add Dictionary 3. Configure Basic Information:

Parameter	Value
Name:	Personal Data Identifiers - GDPR Compliant
Description	Dictionary for identifying personal data elements subject to GDPR protection
Category	Select `Sensitive` or create new category `GDPR_Compliance`
Dictionary Status	Ensure it's enabled (toggle switch)

4. Select Upload Dictionary method 5. Upload CSV File: Upload `personal_data_identifiers.csv` ```csv term Mr. Ms. Mrs. Dr. Sr. Sra. first_name last_name middle_name email_address phone_number birth_date home_address work_address ``` 4. Set Confidence Score: `0.7` (70% confidence for similarity matching) {% hint style="info" %} The confidence score represents how certain the system should be when matching data. For personal data, 0.7 provides a balance between catching sensitive data and avoiding false positives. {% endhint %}

Dictionary - Personal Data Identifiers - GDPR Compliant

5. Set the following Column Name Regex - Metadata Hints.

Regex	Confidence Score
Overall Column Name Confidence Score	0.4
`[Ff]irst[Nn]ame\|[Ff]Name\|given_name`	0.8
`[Ll]ast[Nn]ame\|[Ss]urname\|family_name`	0.8
`[Ee]mail\|[Ee]mail[Aa]ddress\|contact_email`	0.9
`[Pp]hone\|[Pp]hone[Nn]umber\|[Tt]elephone`	0.8

{% hint style="info" %} Regex patterns serve as metadata hints that help identify columns by name patterns. The confidence score formula is: Confidence score = (Similarity × weightage) + (Metadata hint score × weightage). {% endhint %}

**Step 4: Configure Conditions** Create conditions that determine when to apply the dictionary: 1. **Primary Condition:** * **Attribute:** Confidence Score * **Operator:** Greater than or equal to * **Value:** `0.6` 2. **Secondary Condition (OR):** * **Attribute:** Metadata Score * **Operator:** Greater than or equal to * **Value:** `0.7` 3. **Additional Condition (AND):** * **Attribute:** Column Cardinality * **Operator:** Greater than * **Value:** `5` (ensures column has enough data variety) **Why these conditions:** This creates a flexible matching system where data can be classified either by content similarity or column name patterns, ensuring comprehensive coverage. **Step 5: Configure Actions** Define what happens when conditions are met: 1. **Add Action - Assign Tags:** * Tag 1: `PII` * Tag 2: `GDPR_Personal_Data` * Tag 3: `Sensitive` 2. **Add Action - Assign Table Tags:** * Table Tag: `Contains_Personal_Data` 3. **Add Action - Assign Business Terms:** * Browse and select relevant business terms from your glossary * Examples: `Personal Identifier`, `Customer Data`, `Privacy Protected` **Step 6: Create Dictionary** Click **Create Dictionary** to finalize the setup. x x {% endtab %} {% tab title="ZIP File Import" %} {% hint style="info" %} This method uses Pentaho Data Catalog's import functionality to upload ZIP files containing CSV and JSON dictionary definitions. Pentaho Data Catalog requires dictionaries to be imported as ZIP files containing both CSV data and JSON configuration to minimize the risk of errors due to manual input. {% endhint %} 1. Create the following Dictionary folders: ```bash cd ``` x * **Navigate to Dictionaries:** * In the left navigation menu, click **Data Operations** * On the Data Identification Methods card, click **Dictionaries** * **Import Process:** * Click **Import** button on the Dictionaries page * Upload each ZIP file containing the dictionary CSV and JSON * Click **Continue** to start the upload process * Monitor progress using **View Workers** to track the import * **Validation:** * Verify each dictionary appears in the Dictionaries list * Check the **Rules** tab to review the JSON configuration logic * Confirm categories and tags are properly assigned x {% endtab %} {% endtabs %} {% endtab %} {% endtabs %} --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://academy.pentaho.com/pentaho-data-catalog-en/data-catalog/data-identification/personal-data-identifier.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.