Personal Data Identifier

GDPR Compliance ..

Personal Data Identifier Dictionary (GDPR Compliance)

Why does this matter?

Why Regulatory Compliance Matters for Data Dictionaries

Organizations must comply with data privacy regulations that cover 75% of the world's population by the end of 2024, making data classification through dictionaries a critical compliance tool. Data dictionaries serve as the foundation for:

  • Data Classification: Visual labeling, metadata application, and automated data discovery to meet compliance requirements

  • Regulatory Reporting: Providing audit trails for data protection impact assessments (DPIAs)

  • Risk Management: Guarding against accidental data loss and enabling early detection of risky user behavior

Business Purpose: GDPR requires explicit identification and special handling of personal data. This dictionary automatically classifies columns containing personal identifiers to ensure proper data handling and support data subject rights.

The process follows a structured, four-phase approach that progressively builds capabilities from foundational pattern-based detection through to enterprise-wide privacy operations integration.

Phase 1 establishes the critical foundation by implementing dictionary-based identification using metadata hints, pattern matching, and automated tagging within structured databases. This initial phase enables organizations to quickly identify and classify the majority of obvious PII while building the governance framework, processes, and documentation required for regulatory compliance. Upon completion of Phase 1, organizations should plan for

Phase 2 (content inspection and sampling),

Phase 3 (dataflow mapping and lineage tracking), and

Phase 4 (full integration with privacy operations), which progressively enhance detection accuracy, expand coverage to unstructured data, and integrate PII management into broader enterprise systems.

Phase
Focus Area
Description
Key Deliverables

Phase 1: Pattern-Based Discovery (This Workshop)

Structured database PII identification

Implements dictionary-based identification using column name patterns, regex matching, and metadata hints. Focuses on structured data within relational databases. Establishes foundational governance processes, documentation standards, and automated tagging mechanisms.

• Data dictionaries for all PII categories • Automated tagging rules • GDPR compliance documentation • Initial data inventory • Governance procedures

Phase 2: Content Inspection (Future)

Actual data value analysis

Expands beyond metadata to inspect actual data values using sampling, statistical analysis, and machine learning. Detects PII in free-text fields, comments, and unstructured content. Implements Named Entity Recognition (NER) for contextual PII identification.

• Content inspection rules • ML-based classifiers • False positive reduction • Unstructured data coverage • Enhanced accuracy metrics

Phase 3: Data Flow Mapping (Future)

End-to-end lineage tracking

Maps PII movement through ETL pipelines, APIs, reports, exports, and data integrations. Identifies downstream systems receiving PII. Tracks data transformations, aggregations, and derivations to understand complete data lifecycle.

• Complete data lineage maps • API/interface PII exposure analysis • ETL pipeline documentation • Report/dashboard PII tracking • Cross-system impact analysis

Phase 4: Privacy Operations Integration (Future)

Enterprise-wide privacy ecosystem

Integrates PII identification with data masking, access controls, consent management, and data subject request fulfillment. Implements automated breach notification scope assessment, retention policy enforcement, and continuous compliance monitoring.

• Integrated masking policies • Automated DSR fulfillment • Consent tracking integration • Breach assessment automation • Real-time compliance dashboards


GDPR - PII Dictionary

This workshop comprehensively covers Phase 1, providing all necessary resources, templates, procedures, and documentation to successfully implement pattern-based PII discovery.

x

Database Inventory

Objective: Create a complete inventory of all database schemas and tables to ensure comprehensive coverage.

Why This Matters: GDPR compliance requires documenting all data sources within the organization. This inventory forms the foundation for regulatory compliance.

Schema Analysis

The first tasks in our journey was to conduct a comprehensive Database Inventory for compliance.

Data governance requires a complete inventory of all data assets to ensure comprehensive coverage. This inventory forms the foundation for regulatory compliance by documenting all data sources within the organization.

For Phase 1 we're going to take the traditional route and run some SQL scripts - later we'#ll take a look at a 'Project' that leverages ML.

Why This Matters: GDPR compliance requires documenting all data sources within the organization. This enhanced inventory not only catalogs schemas but automatically identifies PII exposure, calculates risk scores, and prioritizes compliance efforts based on actual data sensitivity.

Schema Analysis - Version 2

Expected Results

AdventureWorks2022 contains 5 main schemas with automated PII risk assessment:

Schema
Tables
Rows
Size (MB)
High Risk PII
Medium Risk PII
PII Risk Score
Priority
Data Category

Person

13

~1,755,295

~460

8

13

212.86

CRITICAL

Personal Data

HumanResources

6

~8,179

~6.5

1

1

52.5

CRITICAL

Employee Data

Sales

19

~2,658,063

~287

4

2

36.5

HIGH

Transaction Data

Purchasing

5

~155,373

~16

1

0

20.41

HIGH

Operational Data

Production

25

~3,417,306

~251

1

1

10.06

MEDIUM

Operational Data

dbo

3

~38,308

~51

0

0

0.00

LOW

System Data

Understanding the Output

  • high_risk_pii_columns: Count of columns containing emails, phones, SSN, passwords, credit cards

  • medium_risk_pii_columns: Count of columns with names, addresses, birth dates, postal codes

  • low_risk_pii_columns: Count of columns with titles, suffixes, gender, marital status

  • pii_risk_score: Weighted calculation (High×10 + Medium×5 + Low×2) ÷ Total Columns × 100

  • compliance_priority: Automatic prioritization using the following thresholds:

    • CRITICAL: 3+ high-risk columns OR risk score ≥ 40

    • HIGH: Any high-risk columns OR risk score ≥ 15

    • MEDIUM: Any medium-risk columns OR risk score ≥ 5

    • LOW: Only low-risk columns OR risk score < 5

  • data_category: Automatic classification based on schema naming patterns

Key Insights

Person schema is CRITICAL priority with a PII risk score of 212.86 - contains 8 high-risk and 13 medium-risk PII columns spanning 1.7M+ records. This should be your primary focus for dictionary creation.

  1. HumanResources schema also requires immediate CRITICAL attention with a risk score of 52.5 due to employee data sensitivity.

  2. Sales and Purchasing schemas are HIGH priority containing customer contact information and vendor data.

  3. Production schema has MEDIUM priority with minimal PII exposure (1 high, 1 medium risk column).

  4. dbo schema shows LOW priority with no PII columns detected - likely contains system/configuration data.

Deliverable:

  1. Export this result set to Excel for reference throughout the project

  2. Use the compliance_priority column to sequence your dictionary creation work

  3. Document the pii_risk_score baseline for quarterly tracking of new PII exposure


  1. In DBeaver run the following script:

-- ============================================================================
-- ENHANCED SCHEMA ANALYSIS FOR GDPR COMPLIANCE - COMPLETE EDITION
-- ============================================================================
-- Purpose: Comprehensive database inventory with automated PII risk indicators
-- Target: SQL Server (AdventureWorks 2022 or similar)
-- Version: 2.1 (Updated with improved priority logic and detailed comments)
-- Author: Data Governance Team
-- Last Updated: 2025-10-04
-- ============================================================================
-- 
-- WHAT THIS SCRIPT DOES:
-- - Automatically scans all database schemas for PII-related columns
-- - Calculates risk scores based on column naming patterns
-- - Assigns compliance priorities (CRITICAL/HIGH/MEDIUM/LOW)
-- - Provides data volume metrics and relationship analysis
-- - Generates export-ready reports for compliance documentation
-- 
-- PREREQUISITES:
-- - SQL Server 2016+ (for STRING_AGG function)
-- - Read permissions on sys schema catalog views
-- - Database with profiled/populated tables for accurate row counts
-- 
-- ============================================================================

-- ============================================================================
-- VERSION 1: BASIC ENHANCED (Quick Overview with Size Metrics)
-- ============================================================================
-- USE CASE: Quick schema overview with data volume metrics
-- OUTPUT: Schema summary with row counts, sizes, and last modified dates
-- RUNTIME: Fast (<5 seconds for most databases)
-- WHEN TO USE: Daily monitoring, quick health checks, initial assessment
-- ============================================================================

SELECT
    -- Schema name from system catalog
    s.name AS schema_name,
    
    -- Count distinct tables in each schema
    COUNT(DISTINCT t.name) AS table_count,
    
    -- Sum all rows across tables in schema
    -- Uses sys.partitions for accurate row counts from indexes
    SUM(p.rows) AS total_rows,
    
    -- Calculate total storage size in MB
    -- total_pages = 8KB pages, convert to MB by: (pages * 8) / 1024
    SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS total_size_mb,
    
    -- Count all columns across all tables in schema
    COUNT(DISTINCT c.column_id) AS total_columns,
    
    -- Get most recent modification date across all tables
    MAX(t.modify_date) AS last_modified,
    
    -- Create ordered, comma-separated list of tables
    -- Note: STRING_AGG requires SQL Server 2017+. See alternative below for older versions.
    STRING_AGG(t.name, ', ') WITHIN GROUP (ORDER BY t.name) AS table_list
    
    -- ALTERNATIVE for SQL Server 2016 and earlier:
    -- STUFF((SELECT ', ' + t2.name
    --        FROM sys.tables t2
    --        WHERE t2.schema_id = s.schema_id
    --        ORDER BY t2.name
    --        FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 2, '') AS table_list
    
FROM sys.schemas s
-- LEFT JOIN ensures we see schemas even if they have no tables
LEFT JOIN sys.tables t ON s.schema_id = t.schema_id
-- Get row counts from partitions (index 0=heap, 1=clustered index)
LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1)
-- Get storage allocation data for size calculations
LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id
-- Get column information for column counts
LEFT JOIN sys.columns c ON t.object_id = c.object_id
-- Filter out system schemas that aren't relevant for GDPR analysis
WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest', 'db_owner', 'db_accessadmin', 
                      'db_securityadmin', 'db_ddladmin', 'db_backupoperator', 'db_datareader', 
                      'db_datawriter', 'db_denydatareader', 'db_denydatawriter')
GROUP BY s.name
ORDER BY 
    -- Custom sorting: PII-sensitive schemas first, then by row count
    CASE 
        WHEN s.name = 'Person' THEN 1              -- Highest priority
        WHEN s.name LIKE '%HR%' OR s.name LIKE '%Human%' THEN 2
        WHEN s.name LIKE '%Customer%' OR s.name LIKE '%Sales%' THEN 3
        ELSE 4                                      -- Everything else
    END,
    total_rows DESC;  -- Within same priority level, show largest first

-- ============================================================================
-- VERSION 2: COMPLIANCE-FOCUSED (⭐ RECOMMENDED FOR GDPR ANALYSIS)
-- ============================================================================
-- USE CASE: Complete PII risk assessment with automated prioritization
-- OUTPUT: Schema-level PII detection, risk scoring, and compliance priority
-- RUNTIME: Medium (~15-30 seconds for large databases)
-- WHEN TO USE: Quarterly compliance reviews, GDPR audits, dictionary planning
-- 
-- KEY FEATURES:
--   - Automatic PII column detection across 3 risk levels (high/medium/low)
--   - Calculated PII risk score (0-100 scale)
--   - Smart compliance prioritization (CRITICAL/HIGH/MEDIUM/LOW)
--   - Automatic data category classification
--   - Relationship complexity analysis
-- 
-- CSV EXPORT INSTRUCTIONS:
--   Method 1 (DBeaver): Run query → Right-click results → Export → CSV
--   Method 2 (SSMS): Run query → Right-click results → Save Results As → CSV
--   Method 3 (Command Line): Use the export wrapper at the bottom of this script
-- ============================================================================

-- CTE 1: SchemaMetrics - Gather basic metrics for all tables
WITH SchemaMetrics AS (
    SELECT
        s.name AS schema_name,
        t.name AS table_name,
        
        -- Row count from partition statistics
        -- More accurate than COUNT(*) and much faster
        SUM(p.rows) AS row_count,
        
        -- Table size calculation in MB
        -- total_pages includes data + indexes + LOB storage
        SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS size_mb,
        
        -- Column count per table
        COUNT(DISTINCT c.column_id) AS column_count,
        
        -- Metadata timestamps for change tracking
        t.modify_date,
        t.create_date
    FROM sys.schemas s
    INNER JOIN sys.tables t ON s.schema_id = t.schema_id
    LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1)
    LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id
    LEFT JOIN sys.columns c ON t.object_id = c.object_id
    -- Exclude system schemas that have no business/personal data
    WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest', 'db_owner', 'db_accessadmin', 
                          'db_securityadmin', 'db_ddladmin', 'db_backupoperator', 'db_datareader', 
                          'db_datawriter', 'db_denydatareader', 'db_denydatawriter')
    GROUP BY s.name, t.name, t.modify_date, t.create_date
),

-- CTE 2: PIIIndicators - Detect PII columns using naming pattern analysis
-- This CTE categorizes columns into risk levels based on GDPR sensitivity
PIIIndicators AS (
    SELECT 
        SCHEMA_NAME(t.schema_id) AS schema_name,
        t.name AS table_name,
        
        -- HIGH RISK COLUMNS (PII_CRITICAL)
        -- These columns typically contain highly sensitive personal data
        -- GDPR Articles 4, 9 - Special categories requiring explicit consent
        COUNT(DISTINCT CASE 
            WHEN c.name LIKE '%Email%' OR c.name LIKE '%Mail%' THEN c.column_id
                -- Email addresses: Direct contact, unique identifier
            WHEN c.name LIKE '%Phone%' OR c.name LIKE '%Mobile%' THEN c.column_id
                -- Phone numbers: Direct contact information
            WHEN c.name LIKE '%SSN%' OR c.name LIKE '%National%ID%' THEN c.column_id
                -- National identifiers: Government ID, highly sensitive
            WHEN c.name LIKE '%Password%' OR c.name LIKE '%PWD%' THEN c.column_id
                -- Credentials: Security risk if exposed
            WHEN c.name LIKE '%Credit%' OR c.name LIKE '%Card%' THEN c.column_id
                -- Financial data: Payment card information
        END) AS high_risk_columns,
        
        -- MEDIUM RISK COLUMNS (PII_MEDIUM)
        -- Personal identifiers that combined can identify individuals
        -- GDPR Article 4(1) - Personal data definition
        COUNT(DISTINCT CASE 
            WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%' THEN c.column_id
                -- Names: Core personal identifiers
            WHEN c.name LIKE '%MiddleName%' THEN c.column_id
                -- Middle names: Additional personal identifier
            WHEN c.name LIKE '%Address%' OR c.name LIKE '%Street%' THEN c.column_id
                -- Physical addresses: Location data
            WHEN c.name LIKE '%City%' OR c.name LIKE '%PostalCode%' OR c.name LIKE '%Zip%' THEN c.column_id
                -- Location identifiers: Geographic personal data
            WHEN c.name LIKE '%BirthDate%' OR c.name LIKE '%DOB%' THEN c.column_id
                -- Birth dates: Age-sensitive personal information
        END) AS medium_risk_columns,
        
        -- LOW RISK COLUMNS (PII_LOW)
        -- Indirect identifiers, less sensitive but still personal data
        COUNT(DISTINCT CASE 
            WHEN c.name LIKE '%Title%' AND SCHEMA_NAME(t.schema_id) = 'Person' THEN c.column_id
                -- Titles (Mr., Mrs., Dr.): Personal honorifics
            WHEN c.name LIKE '%Suffix%' THEN c.column_id
                -- Name suffixes (Jr., Sr., III): Family relationships
            WHEN c.name LIKE '%Gender%' OR c.name LIKE '%Sex%' THEN c.column_id
                -- Gender: Demographic personal data
            WHEN c.name LIKE '%Marital%' THEN c.column_id
                -- Marital status: Personal demographic information
        END) AS low_risk_columns
    FROM sys.tables t
    INNER JOIN sys.columns c ON t.object_id = c.object_id
    WHERE SCHEMA_NAME(t.schema_id) NOT IN ('sys', 'INFORMATION_SCHEMA')
    GROUP BY SCHEMA_NAME(t.schema_id), t.name
),

-- CTE 3: RelationshipCount - Analyze foreign key relationships
-- Used to understand data lineage and impact of schema changes
RelationshipCount AS (
    SELECT
        OBJECT_SCHEMA_NAME(fk.parent_object_id) AS schema_name,
        OBJECT_NAME(fk.parent_object_id) AS table_name,
        
        -- Outbound FKs: How many other tables this table references
        COUNT(DISTINCT fk.object_id) AS outbound_fk_count,
        
        -- Inbound FKs: How many tables reference this table
        COUNT(DISTINCT fk.referenced_object_id) AS inbound_fk_count
    FROM sys.foreign_keys fk
    GROUP BY OBJECT_SCHEMA_NAME(fk.parent_object_id), OBJECT_NAME(fk.parent_object_id)
)

-- MAIN QUERY: Aggregate all metrics at schema level
SELECT
    sm.schema_name,
    COUNT(DISTINCT sm.table_name) AS table_count,
    SUM(sm.row_count) AS total_rows,
    CAST(SUM(sm.size_mb) AS DECIMAL(10,2)) AS total_size_mb,
    SUM(sm.column_count) AS total_columns,
    
    -- PII Column Counts by Risk Level
    SUM(ISNULL(pii.high_risk_columns, 0)) AS high_risk_pii_columns,
    SUM(ISNULL(pii.medium_risk_columns, 0)) AS medium_risk_pii_columns,
    SUM(ISNULL(pii.low_risk_columns, 0)) AS low_risk_pii_columns,
    
    -- COMPLIANCE RISK SCORE CALCULATION (0-100 scale)
    -- Formula: (High×10 + Medium×5 + Low×2) ÷ Total Columns × 100
    -- Rationale:
    --   - High risk weighted 10x (critical impact if breached)
    --   - Medium risk weighted 5x (significant impact)
    --   - Low risk weighted 2x (minor impact but still PII)
    --   - Normalized to total columns for fair comparison across schemas
    CAST(
        (SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
         SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
         SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / 
        NULLIF(SUM(sm.column_count), 0) * 100  -- NULLIF prevents divide by zero
    AS DECIMAL(5,2)) AS pii_risk_score,
    
    -- COMPLIANCE PRIORITY ASSIGNMENT (UPDATED LOGIC v2.1)
    -- Uses both absolute column counts AND risk scores for balanced assessment
    -- 
    -- CRITICAL: Schemas requiring immediate attention
    --   - 3+ high-risk columns (multiple sensitive data types present)
    --   - OR risk score ≥ 40 (>40% of columns are PII-weighted)
    --   Examples: Person, HumanResources schemas
    -- 
    -- HIGH: Schemas requiring priority attention
    --   - Any high-risk columns (at least one critical PII type)
    --   - OR risk score ≥ 15 (>15% PII density)
    --   Examples: Sales, Customer schemas
    -- 
    -- MEDIUM: Schemas requiring standard monitoring
    --   - Any medium-risk columns (personal identifiers present)
    --   - OR risk score ≥ 5 (>5% PII density)
    --   Examples: Operational schemas with occasional PII
    -- 
    -- LOW: Schemas with minimal PII exposure
    --   - Only low-risk columns OR risk score < 5
    --   Examples: Product, Configuration schemas
    CASE 
        WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) >= 3) 
          OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
                    SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
                    SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / 
                   NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 40) 
        THEN 'CRITICAL'
        WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) > 0) 
          OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
                    SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
                    SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / 
                   NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 15)
        THEN 'HIGH'
        WHEN (SUM(ISNULL(pii.medium_risk_columns, 0)) > 0) 
          OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
                    SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
                    SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / 
                   NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 5)
        THEN 'MEDIUM'
        ELSE 'LOW'
    END AS compliance_priority,
    
    -- DATA CATEGORY CLASSIFICATION
    -- Automatic classification based on schema naming conventions
    -- Helps organize compliance efforts by data type
    CASE 
        WHEN sm.schema_name LIKE '%Person%' OR sm.schema_name LIKE '%Customer%' 
            THEN 'Personal Data'
        WHEN sm.schema_name LIKE '%HR%' OR sm.schema_name LIKE '%Human%' 
            OR sm.schema_name LIKE '%Employee%' 
            THEN 'Employee Data'
        WHEN sm.schema_name LIKE '%Sales%' OR sm.schema_name LIKE '%Order%' 
            THEN 'Transaction Data'
        WHEN sm.schema_name LIKE '%Financial%' OR sm.schema_name LIKE '%Payment%' 
            THEN 'Financial Data'
        ELSE 'Operational Data'
    END AS data_category,
    
    -- Latest modification timestamp across all tables in schema
    MAX(sm.modify_date) AS last_modified,
    
    -- Ordered table list for reference
    -- STRING_AGG requires SQL Server 2017+
    -- For SQL Server 2016 and earlier, use the commented alternative below
    STRING_AGG(sm.table_name, ', ') WITHIN GROUP (ORDER BY sm.table_name) AS table_list
    
    -- ALTERNATIVE for SQL Server 2016 and earlier (uncomment if needed):
    -- (SELECT STUFF((SELECT ', ' + sm2.table_name
    --                FROM SchemaMetrics sm2
    --                WHERE sm2.schema_name = sm.schema_name
    --                ORDER BY sm2.table_name
    --                FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 2, '')) AS table_list
    
FROM SchemaMetrics sm
LEFT JOIN PIIIndicators pii 
    ON sm.schema_name = pii.schema_name 
    AND sm.table_name = pii.table_name
LEFT JOIN RelationshipCount rc 
    ON sm.schema_name = rc.schema_name 
    AND sm.table_name = rc.table_name
GROUP BY sm.schema_name

-- ORDER BY: Prioritize results for action planning
-- Primary sort: Compliance priority (CRITICAL first)
-- Secondary sort: Total rows (larger datasets first within same priority)
ORDER BY 
    CASE 
        WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) >= 3) 
          OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
                    SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
                    SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / 
                   NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 40) 
        THEN 1  -- CRITICAL schemas first
        WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) > 0) 
          OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
                    SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
                    SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / 
                   NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 15)
        THEN 2  -- HIGH schemas second
        WHEN (SUM(ISNULL(pii.medium_risk_columns, 0)) > 0) 
          OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
                    SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
                    SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) / 
                   NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 5)
        THEN 3  -- MEDIUM schemas third
        ELSE 4  -- LOW schemas last
    END,
    total_rows DESC;

-- ============================================================================
-- CSV EXPORT WRAPPER FOR VERSION 2
-- ============================================================================
-- INSTRUCTIONS: 
-- 1. Uncomment the section below to export directly to CSV
-- 2. Update the file path to your desired location
-- 3. Ensure SQL Server has write permissions to the target directory
-- 4. Run this entire script block
-- ============================================================================

/*
-- Enable xp_cmdshell if not already enabled (requires admin rights)
-- EXEC sp_configure 'show advanced options', 1;
-- RECONFIGURE;
-- EXEC sp_configure 'xp_cmdshell', 1;
-- RECONFIGURE;

-- Export to CSV using BCP utility
DECLARE @sql NVARCHAR(MAX);
DECLARE @outputPath NVARCHAR(500) = 'C:\Exports\GDPR_Schema_Analysis.csv';  -- UPDATE THIS PATH

SET @sql = 'bcp "' + DB_NAME() + '.dbo.vw_gdpr_compliance_export" queryout "' 
    + @outputPath + '" -c -t, -T -S ' + @@SERVERNAME;
    
EXEC xp_cmdshell @sql;
*/

-- ALTERNATIVE: Create a view for easier exports
-- CREATE VIEW vw_gdpr_compliance_export AS
-- [Paste the entire Version 2 query here]
-- Then export view results using your preferred tool

-- ============================================================================
-- PRIORITY THRESHOLDS EXPLAINED:
-- ============================================================================
-- 
-- THRESHOLD DESIGN RATIONALE:
-- These thresholds balance sensitivity with practicality
-- 
-- CRITICAL: 3+ high-risk columns OR risk score ≥ 40
--   Why: Multiple high-risk columns indicate concentrated PII exposure
--        Risk score ≥40 means >40% of schema contains weighted PII
--   Action Required: Immediate dictionary creation, data masking, audit logging
--   SLA: 1 week for initial assessment and protection
-- 
-- HIGH: Any high-risk columns OR risk score ≥ 15
--   Why: Even single high-risk columns (email, SSN) require protection
--        15% threshold catches schemas with moderate PII concentration
--   Action Required: Priority dictionary creation, access controls
--   SLA: 2-3 weeks for protection implementation
-- 
-- MEDIUM: Any medium-risk columns OR risk score ≥ 5
--   Why: Medium-risk columns (names, addresses) still qualify as personal data
--        5% threshold includes schemas with sparse but present PII
--   Action Required: Standard dictionary creation, monitoring
--   SLA: 4-6 weeks for implementation
-- 
-- LOW: Only low-risk columns OR risk score < 5
--   Why: Minimal PII exposure, primarily operational data
--   Action Required: Routine monitoring, no immediate action
--   SLA: Quarterly review cycle
-- 
-- ============================================================================

-- ============================================================================
-- VERSION 3: DETAILED TABLE-LEVEL ANALYSIS (Deep Dive)
-- ============================================================================
-- USE CASE: Detailed analysis of specific schemas
-- OUTPUT: Table-by-table breakdown with specific PII column identification
-- RUNTIME: Fast for single schema (~3-5 seconds)
-- WHEN TO USE: After Version 2 identifies high-priority schemas
-- 
-- INSTRUCTIONS: 
-- 1. Replace 'Person' with your target schema name (line 270)
-- 2. Run query to see table-level detail
-- 3. Use results to plan dictionary terms and regex patterns
-- ============================================================================

WITH TablePIIAnalysis AS (
    SELECT 
        SCHEMA_NAME(t.schema_id) AS schema_name,
        t.name AS table_name,
        t.create_date,
        t.modify_date,
        
        -- Table metrics
        SUM(p.rows) AS row_count,
        SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS size_mb,
        COUNT(DISTINCT c.column_id) AS column_count,
        
        -- PII COLUMN DETECTION WITH TYPE LABELS
        -- Creates human-readable list of PII columns with their types
        -- Format: "ColumnName (TYPE)", "ColumnName (TYPE)"
        -- This output is perfect for:
        --   - Building dictionary CSV files
        --   - Creating regex patterns
        --   - Documentation and audit reports
        STRING_AGG(
            CASE 
                -- Email/Mail columns: Direct contact, unique identifier
                WHEN c.name LIKE '%Email%' OR c.name LIKE '%Mail%' 
                    THEN c.name + ' (EMAIL)'
                    
                -- Phone columns: Contact information
                WHEN c.name LIKE '%Phone%' OR c.name LIKE '%Mobile%' OR c.name LIKE '%Fax%' 
                    THEN c.name + ' (PHONE)'
                    
                -- Credential columns: Critical security data
                WHEN c.name LIKE '%Password%' OR c.name LIKE '%PWD%' 
                    THEN c.name + ' (PASSWORD)'
                    
                -- National ID columns: Government identifiers
                WHEN c.name LIKE '%SSN%' OR c.name LIKE '%National%ID%' 
                    THEN c.name + ' (NATIONAL_ID)'
                    
                -- Financial columns: Payment information
                WHEN c.name LIKE '%Credit%Card%' 
                    THEN c.name + ' (CREDIT_CARD)'
                    
                -- Name columns: Personal identifiers
                WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%' 
                    OR c.name LIKE '%MiddleName%' 
                    THEN c.name + ' (NAME)'
                    
                -- Address columns: Location data
                WHEN c.name LIKE '%Address%' OR c.name LIKE '%Street%' 
                    THEN c.name + ' (ADDRESS)'
                    
                -- Birth date columns: Age-sensitive data
                WHEN c.name LIKE '%BirthDate%' OR c.name LIKE '%DOB%' 
                    THEN c.name + ' (BIRTHDATE)'
                    
                -- Location columns: Geographic identifiers
                WHEN c.name LIKE '%City%' OR c.name LIKE '%Postal%' OR c.name LIKE '%Zip%' 
                    THEN c.name + ' (LOCATION)'
            END, 
            ', '  -- Comma separator for readability
        ) AS potential_pii_columns,
        
        -- RELATIONSHIP COMPLEXITY ANALYSIS
        -- Understanding relationships helps with:
        --   - Data lineage mapping
        --   - Impact assessment for data changes
        --   - Identifying data flow patterns
        
        -- Outbound relationships: Tables this table depends on
        (SELECT COUNT(*) 
         FROM sys.foreign_keys fk 
         WHERE fk.parent_object_id = t.object_id) AS outbound_relationships,
         
        -- Inbound relationships: Tables that depend on this table
        (SELECT COUNT(*) 
         FROM sys.foreign_keys fk 
         WHERE fk.referenced_object_id = t.object_id) AS inbound_relationships
    FROM sys.tables t
    LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1)
    LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id
    LEFT JOIN sys.columns c ON t.object_id = c.object_id
    WHERE SCHEMA_NAME(t.schema_id) = 'Person'  -- ⚠️ CHANGE SCHEMA NAME HERE
    GROUP BY SCHEMA_NAME(t.schema_id), t.name, t.create_date, t.modify_date, t.object_id
)
SELECT
    schema_name,
    table_name,
    row_count,
    CAST(size_mb AS DECIMAL(10,2)) AS size_mb,
    column_count,
    potential_pii_columns,
    outbound_relationships,
    inbound_relationships,
    (outbound_relationships + inbound_relationships) AS total_relationships,
    
    -- TABLE-LEVEL PRIORITIZATION
    -- Assigns priority to individual tables within a schema
    -- Useful for sequencing dictionary creation work
    CASE 
        WHEN potential_pii_columns LIKE '%PASSWORD%' 
            OR potential_pii_columns LIKE '%CREDIT_CARD%' 
            OR potential_pii_columns LIKE '%NATIONAL_ID%' 
        THEN 'CRITICAL'
        WHEN potential_pii_columns LIKE '%EMAIL%' 
            OR potential_pii_columns LIKE '%BIRTHDATE%' 
            OR potential_pii_columns LIKE '%ADDRESS%' 
        THEN 'HIGH'
        WHEN potential_pii_columns LIKE '%NAME%' 
            OR potential_pii_columns LIKE '%PHONE%' 
        THEN 'MEDIUM'
        WHEN potential_pii_columns IS NOT NULL 
        THEN 'LOW'
        ELSE 'NONE'
    END AS pii_priority,
    
    -- Data freshness indicator
    DATEDIFF(DAY, modify_date, GETDATE()) AS days_since_modified,
    create_date,
    modify_date
FROM TablePIIAnalysis
ORDER BY 
    -- Sort by priority level first
    CASE 
        WHEN potential_pii_columns LIKE '%PASSWORD%' OR potential_pii_columns LIKE '%CREDIT_CARD%' THEN 1
        WHEN potential_pii_columns LIKE '%EMAIL%' OR potential_pii_columns LIKE '%BIRTHDATE%' THEN 2
        WHEN potential_pii_columns LIKE '%NAME%' OR potential_pii_columns LIKE '%PHONE%' THEN 3
        WHEN potential_pii_columns IS NOT NULL THEN 4
        ELSE 5
    END,
    -- Then by row count (larger tables first)
    row_count DESC;

-- ============================================================================
-- VERSION 4: EXPORT-READY FORMAT (For Documentation & Reports)
-- ============================================================================
-- USE CASE: Clean output for compliance reports and executive summaries
-- OUTPUT: Formatted data ready for Excel export or compliance documentation
-- RUNTIME: Fast (<10 seconds)
-- WHEN TO USE: Monthly/Quarterly compliance reports, board presentations
-- 
-- KEY FEATURES:
--   - Human-readable number formatting (with commas)
--   - Binary indicators (Yes/No for PII presence)
--   - Clean column names for non-technical audiences
--   - Risk scores for quantitative tracking
-- ============================================================================

WITH SchemaInventory AS (
    SELECT
        -- Clean, business-friendly column aliases
        s.name AS [Schema Name],
        COUNT(DISTINCT t.name) AS [Table Count],
        
        -- Format numbers with thousand separators for readability
        -- FORMAT function converts to string with localization
        FORMAT(SUM(p.rows), 'N0') AS [Total Rows],
        
        -- Size in MB with 2 decimal places
        CAST(SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS DECIMAL(10,2)) AS [Size (MB)],
        COUNT(DISTINCT c.column_id) AS [Total Columns],
        
        -- Calculate risk score using same formula as Version 2
        -- Stored in CTE for reuse in priority calculation
        CAST(
            (COUNT(DISTINCT CASE 
                WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' 
                  OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
                  OR c.name LIKE '%Credit%' OR c.name LIKE '%National%ID%'
                THEN c.column_id END) * 10.0 +
             COUNT(DISTINCT CASE 
                WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%' 
                  OR c.name LIKE '%Address%' OR c.name LIKE '%BirthDate%'
                THEN c.column_id END) * 5.0 +
             COUNT(DISTINCT CASE 
                WHEN c.name LIKE '%Title%' OR c.name LIKE '%Gender%'
                THEN c.column_id END) * 2.0) / 
            NULLIF(COUNT(DISTINCT c.column_id), 0) * 100
        AS DECIMAL(5,2)) AS risk_score,
        
        -- Business-friendly data classification
        CASE 
            WHEN s.name LIKE '%Person%' OR s.name LIKE '%Customer%' 
                THEN 'Contains Personal Data'
            WHEN s.name LIKE '%HR%' OR s.name LIKE '%Human%' OR s.name LIKE '%Employee%' 
                THEN 'Contains Employee Data'
            WHEN s.name LIKE '%Sales%' 
                THEN 'Contains Transaction Data'
            ELSE 'Operational Data'
        END AS [Data Classification],
        
        -- Binary PII indicator for quick scanning
        CASE 
            WHEN COUNT(DISTINCT CASE 
                WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' 
                  OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
                THEN c.column_id END) > 0 
            THEN 'Yes'
            ELSE 'No'
        END AS [Contains PII],
        
        -- Priority calculation with updated thresholds
        CASE 
            WHEN COUNT(DISTINCT CASE 
                WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' 
                  OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
                THEN c.column_id END) >= 3 
              OR CAST((COUNT(DISTINCT CASE 
                    WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' 
                      OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
                    THEN c.column_id END) * 10.0) / 
                   NULLIF(COUNT(DISTINCT c.column_id), 0) * 100 AS DECIMAL(5,2)) >= 40
            THEN 'CRITICAL'
            WHEN COUNT(DISTINCT CASE 
                WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' 
                  OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
                THEN c.column_id END) > 0
              OR CAST((COUNT(DISTINCT CASE 
                    WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%' 
                      OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
                    THEN c.column_id END) * 10.0) / 
                   NULLIF(COUNT(DISTINCT c.column_id), 0) * 100 AS DECIMAL(5,2)) >= 15
            THEN 'HIGH'
            WHEN COUNT(DISTINCT CASE 
                WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%Address%'
                THEN c.column_id END) > 0
            THEN 'MEDIUM'
            ELSE 'LOW'
        END AS [Priority],
        
        -- Formatted date for reports
        FORMAT(MAX(t.modify_date), 'yyyy-MM-dd') AS [Last Modified],
        
        -- Semicolon-separated list (easier to read than commas in CSV)
        STRING_AGG(t.name, '; ') WITHIN GROUP (ORDER BY t.name) AS [Tables]
    FROM sys.schemas s
    LEFT JOIN sys.tables t ON s.schema_id = t.schema_id
    LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1)
    LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id
    LEFT JOIN sys.columns c ON t.object_id = c.object_id
    WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest')
    GROUP BY s.name
)
SELECT 
    [Schema Name],
    [Table Count],
    [Total Rows],
    [Size (MB)],
    [Total Columns],
    [Data Classification],
    [Contains PII],
    [Priority],
    CAST(risk_score AS DECIMAL(5,2)) AS [Risk Score],
    [Last Modified],
    [Tables]
FROM SchemaInventory
ORDER BY 
    -- Sort by priority for executive summary
    CASE [Priority] 
        WHEN 'CRITICAL' THEN 1 
        WHEN 'HIGH' THEN 2 
        WHEN 'MEDIUM' THEN 3 
        ELSE 4 
    END,
    -- Then by data volume
    CAST(REPLACE([Total Rows], ',', '') AS BIGINT) DESC;

-- ============================================================================
-- COMPREHENSIVE USAGE GUIDE & RECOMMENDATIONS:
-- ============================================================================
-- 
-- WORKFLOW SEQUENCE:
-- 
-- 1. INITIAL ASSESSMENT (Week 1)
--    → Run VERSION 2 (Compliance-Focused)
--    → Identify CRITICAL and HIGH priority schemas
--    → Document baseline risk scores
--    → Present findings to stakeholders
-- 
-- 2. DEEP DIVE ANALYSIS (Week 2-3)
--    → Run VERSION 3 (Detailed) for each CRITICAL schema
--    → List specific PII columns discovered
--    → Map data relationships and dependencies
--    → Create data flow diagrams
-- 
-- 3. DICTIONARY PLANNING (Week 3-4)
--    → Use VERSION 3 results to build dictionary CSV files
--    → Extract actual values for dictionary terms
--    → Create regex patterns based on column names
--    → Define confidence scores and conditions
-- 
-- 4. COMPLIANCE REPORTING (Ongoing)
--    → Run VERSION 4 (Export-Ready) monthly/quarterly
--    → Track risk score trends over time
--    → Generate executive compliance reports
--    → Document remediation progress
-- 
-- 5. ROUTINE MONITORING (Ongoing)
--    → Run VERSION 1 (Basic Enhanced) weekly
--    → Alert on new tables in Person/HR schemas
--    → Monitor schema modification dates
--    → Track data volume growth
-- 
-- INTEGRATION RECOMMENDATIONS:
-- 
-- - Schedule VERSION 2 as SQL Agent Job (monthly)
-- - Create alerts for risk score increases >10 points
-- - Integrate results with data governance dashboard
-- - Link to data lineage visualization tools
-- - Feed into data catalog metadata
-- 
-- CUSTOMIZATION TIPS:
-- 
-- - Add industry-specific PII patterns (HIPAA, PCI-DSS, etc.)
-- - Adjust risk score weights based on business impact
-- - Modify priority thresholds to match org risk appetite
-- - Extend data categories for domain-specific classification
-- - Add columns for data retention policies
-- 
-- ============================================================================
-- PII DETECTION PATTERNS REFERENCE:
-- ============================================================================
-- 
-- HIGH RISK PATTERNS (Weight: 10x):
--   %Email%, %Mail%           → Email addresses (unique identifiers)
--   %Phone%, %Mobile%, %Fax%  → Phone numbers (direct contact)
--   %SSN%, %National%ID%      → Government identifiers
--   %Password%, %PWD%         → Authentication credentials
--   %Credit%, %Card%          → Payment card information
-- 
-- MEDIUM RISK PATTERNS (Weight: 5x):
--   %FirstName%, %LastName%   → Personal names
--   %MiddleName%              → Additional name identifiers
--   %Address%, %Street%       → Physical addresses
--   %City%, %PostalCode%      → Location data
--   %BirthDate%, %DOB%        → Age-sensitive information
-- 
-- LOW RISK PATTERNS (Weight: 2x):
--   %Title%                   → Honorifics (Mr., Dr., etc.)
--   %Suffix%                  → Name suffixes (Jr., Sr.)
--   %Gender%, %Sex%           → Demographic data
--   %Marital%                 → Marital status
-- 
-- TO EXTEND PATTERNS:
-- 1. Add new CASE statements in PIIIndicators CTE
-- 2. Update risk score calculation if adding new risk levels
-- 3. Document new patterns in this reference section
-- 4. Test on sample schemas before production deployment
-- 
-- EXAMPLES OF EXTENSIONS:
-- - HIPAA: %Diagnosis%, %Prescription%, %Treatment%
-- - PCI-DSS: %CVV%, %CardHolder%, %Expiry%
-- - Industry: %Contract%, %Agreement%, %NDA%
-- 
-- ============================================================================
-- TROUBLESHOOTING:
-- ============================================================================
-- 
-- Issue: Row counts showing as 0
-- Solution: Run UPDATE STATISTICS or REBUILD indexes
-- 
-- Issue: STRING_AGG function not found
-- Solution: Upgrade to SQL Server 2017+ or use FOR XML PATH alternative
-- 
-- Issue: Performance slow on large databases
-- Solution: Add WHERE clause to filter specific schemas
--           Create indexed view on sys.columns for better performance
-- 
-- Issue: Risk scores seem too high/low
-- Solution: Adjust weights in risk score formula (currently 10/5/2)
--           Review and update PII pattern matches
-- 
-- Issue: Missing expected schemas in output
-- Solution: Check schema name against exclusion list (line 41)
--           Verify schema contains actual tables (not just empty schema)
-- 
-- ============================================================================
-- END OF SCRIPT
-- ============================================================================

Last updated

Was this helpful?