Personal Data Identifier
GDPR Compliance ..
Personal Data Identifier Dictionary (GDPR Compliance)
This comprehensive workshop demonstrates how to create data dictionaries that not only improve data discovery but also ensure regulatory compliance and support robust data governance frameworks.
Using Microsoft's AdventureWorks2022 database as our foundation, participants will learn to build data dictionaries that meet the stringent requirements of GDPR, SOX, and other regulatory frameworks while establishing best practices for ongoing metadata management.
By the end of this workshop, you will be able to:
Create comprehensive data dictionaries that support regulatory compliance requirements
Understand the relationship between data dictionaries, data lineage, and metadata management
Implement data classification schemes that meet industry-specific regulatory standards
Establish processes for dictionary maintenance that ensure ongoing compliance
Integrate dictionaries with broader data governance and security frameworks
Design audit trails and documentation that satisfy regulatory requirements
Phase 1: Pattern-Based Discovery (This Workshop)
Structured database PII identification
Implements dictionary-based identification using column name patterns, regex matching, and metadata hints. Focuses on structured data within relational databases. Establishes foundational governance processes, documentation standards, and automated tagging mechanisms.
• Data dictionaries for all PII categories • Automated tagging rules • GDPR compliance documentation • Initial data inventory • Governance procedures
Phase 2: Content Inspection (Future)
Actual data value analysis
Expands beyond metadata to inspect actual data values using sampling, statistical analysis, and machine learning. Detects PII in free-text fields, comments, and unstructured content. Implements Named Entity Recognition (NER) for contextual PII identification.
• Content inspection rules • ML-based classifiers • False positive reduction • Unstructured data coverage • Enhanced accuracy metrics
Phase 3: Data Flow Mapping (Future)
End-to-end lineage tracking
Maps PII movement through ETL pipelines, APIs, reports, exports, and data integrations. Identifies downstream systems receiving PII. Tracks data transformations, aggregations, and derivations to understand complete data lifecycle.
• Complete data lineage maps • API/interface PII exposure analysis • ETL pipeline documentation • Report/dashboard PII tracking • Cross-system impact analysis
Phase 4: Privacy Operations Integration (Future)
Enterprise-wide privacy ecosystem
Integrates PII identification with data masking, access controls, consent management, and data subject request fulfillment. Implements automated breach notification scope assessment, retention policy enforcement, and continuous compliance monitoring.
• Integrated masking policies • Automated DSR fulfillment • Consent tracking integration • Breach assessment automation • Real-time compliance dashboards
GDPR - PII Dictionary
This workshop comprehensively covers Phase 1, providing all necessary resources, templates, procedures, and documentation to successfully implement pattern-based PII discovery.
x
Database Inventory
Objective: Create a complete inventory of all database schemas and tables to ensure comprehensive coverage.
Why This Matters: GDPR compliance requires documenting all data sources within the organization. This inventory forms the foundation for regulatory compliance.
Schema Analysis
The first tasks in our journey was to conduct a comprehensive Database Inventory for compliance.
Data governance requires a complete inventory of all data assets to ensure comprehensive coverage. This inventory forms the foundation for regulatory compliance by documenting all data sources within the organization.
For Phase 1 we're going to take the traditional route and run some SQL scripts - later we'#ll take a look at a 'Project' that leverages ML.
Why This Matters: GDPR compliance requires documenting all data sources within the organization. This enhanced inventory not only catalogs schemas but automatically identifies PII exposure, calculates risk scores, and prioritizes compliance efforts based on actual data sensitivity.

Person
13
~1,755,295
~460
8
13
212.86
CRITICAL
Personal Data
HumanResources
6
~8,179
~6.5
1
1
52.5
CRITICAL
Employee Data
Sales
19
~2,658,063
~287
4
2
36.5
HIGH
Transaction Data
Purchasing
5
~155,373
~16
1
0
20.41
HIGH
Operational Data
Production
25
~3,417,306
~251
1
1
10.06
MEDIUM
Operational Data
dbo
3
~38,308
~51
0
0
0.00
LOW
System Data
Understanding the Output
high_risk_pii_columns: Count of columns containing emails, phones, SSN, passwords, credit cards
medium_risk_pii_columns: Count of columns with names, addresses, birth dates, postal codes
low_risk_pii_columns: Count of columns with titles, suffixes, gender, marital status
pii_risk_score: Weighted calculation (High×10 + Medium×5 + Low×2) ÷ Total Columns × 100
compliance_priority: Automatic prioritization using the following thresholds:
CRITICAL: 3+ high-risk columns OR risk score ≥ 40
HIGH: Any high-risk columns OR risk score ≥ 15
MEDIUM: Any medium-risk columns OR risk score ≥ 5
LOW: Only low-risk columns OR risk score < 5
data_category: Automatic classification based on schema naming patterns
Key Insights
Person schema is CRITICAL priority with a PII risk score of 212.86 - contains 8 high-risk and 13 medium-risk PII columns spanning 1.7M+ records. This should be your primary focus for dictionary creation.
HumanResources schema also requires immediate CRITICAL attention with a risk score of 52.5 due to employee data sensitivity.
Sales and Purchasing schemas are HIGH priority containing customer contact information and vendor data.
Production schema has MEDIUM priority with minimal PII exposure (1 high, 1 medium risk column).
dbo schema shows LOW priority with no PII columns detected - likely contains system/configuration data.
Deliverable:
Export this result set to Excel for reference throughout the project
Use the
compliance_prioritycolumn to sequence your dictionary creation workDocument the
pii_risk_scorebaseline for quarterly tracking of new PII exposure
In DBeaver run the following script:
-- ============================================================================
-- ENHANCED SCHEMA ANALYSIS FOR GDPR COMPLIANCE - COMPLETE EDITION
-- ============================================================================
-- Purpose: Comprehensive database inventory with automated PII risk indicators
-- Target: SQL Server (AdventureWorks 2022 or similar)
-- Version: 2.1 (Updated with improved priority logic and detailed comments)
-- Author: Data Governance Team
-- Last Updated: 2025-10-04
-- ============================================================================
--
-- WHAT THIS SCRIPT DOES:
-- - Automatically scans all database schemas for PII-related columns
-- - Calculates risk scores based on column naming patterns
-- - Assigns compliance priorities (CRITICAL/HIGH/MEDIUM/LOW)
-- - Provides data volume metrics and relationship analysis
-- - Generates export-ready reports for compliance documentation
--
-- PREREQUISITES:
-- - SQL Server 2016+ (for STRING_AGG function)
-- - Read permissions on sys schema catalog views
-- - Database with profiled/populated tables for accurate row counts
--
-- ============================================================================
-- ============================================================================
-- VERSION 1: BASIC ENHANCED (Quick Overview with Size Metrics)
-- ============================================================================
-- USE CASE: Quick schema overview with data volume metrics
-- OUTPUT: Schema summary with row counts, sizes, and last modified dates
-- RUNTIME: Fast (<5 seconds for most databases)
-- WHEN TO USE: Daily monitoring, quick health checks, initial assessment
-- ============================================================================
SELECT
-- Schema name from system catalog
s.name AS schema_name,
-- Count distinct tables in each schema
COUNT(DISTINCT t.name) AS table_count,
-- Sum all rows across tables in schema
-- Uses sys.partitions for accurate row counts from indexes
SUM(p.rows) AS total_rows,
-- Calculate total storage size in MB
-- total_pages = 8KB pages, convert to MB by: (pages * 8) / 1024
SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS total_size_mb,
-- Count all columns across all tables in schema
COUNT(DISTINCT c.column_id) AS total_columns,
-- Get most recent modification date across all tables
MAX(t.modify_date) AS last_modified,
-- Create ordered, comma-separated list of tables
-- Note: STRING_AGG requires SQL Server 2017+. See alternative below for older versions.
STRING_AGG(t.name, ', ') WITHIN GROUP (ORDER BY t.name) AS table_list
-- ALTERNATIVE for SQL Server 2016 and earlier:
-- STUFF((SELECT ', ' + t2.name
-- FROM sys.tables t2
-- WHERE t2.schema_id = s.schema_id
-- ORDER BY t2.name
-- FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 2, '') AS table_list
FROM sys.schemas s
-- LEFT JOIN ensures we see schemas even if they have no tables
LEFT JOIN sys.tables t ON s.schema_id = t.schema_id
-- Get row counts from partitions (index 0=heap, 1=clustered index)
LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1)
-- Get storage allocation data for size calculations
LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id
-- Get column information for column counts
LEFT JOIN sys.columns c ON t.object_id = c.object_id
-- Filter out system schemas that aren't relevant for GDPR analysis
WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest', 'db_owner', 'db_accessadmin',
'db_securityadmin', 'db_ddladmin', 'db_backupoperator', 'db_datareader',
'db_datawriter', 'db_denydatareader', 'db_denydatawriter')
GROUP BY s.name
ORDER BY
-- Custom sorting: PII-sensitive schemas first, then by row count
CASE
WHEN s.name = 'Person' THEN 1 -- Highest priority
WHEN s.name LIKE '%HR%' OR s.name LIKE '%Human%' THEN 2
WHEN s.name LIKE '%Customer%' OR s.name LIKE '%Sales%' THEN 3
ELSE 4 -- Everything else
END,
total_rows DESC; -- Within same priority level, show largest first
-- ============================================================================
-- VERSION 2: COMPLIANCE-FOCUSED (⭐ RECOMMENDED FOR GDPR ANALYSIS)
-- ============================================================================
-- USE CASE: Complete PII risk assessment with automated prioritization
-- OUTPUT: Schema-level PII detection, risk scoring, and compliance priority
-- RUNTIME: Medium (~15-30 seconds for large databases)
-- WHEN TO USE: Quarterly compliance reviews, GDPR audits, dictionary planning
--
-- KEY FEATURES:
-- - Automatic PII column detection across 3 risk levels (high/medium/low)
-- - Calculated PII risk score (0-100 scale)
-- - Smart compliance prioritization (CRITICAL/HIGH/MEDIUM/LOW)
-- - Automatic data category classification
-- - Relationship complexity analysis
--
-- CSV EXPORT INSTRUCTIONS:
-- Method 1 (DBeaver): Run query → Right-click results → Export → CSV
-- Method 2 (SSMS): Run query → Right-click results → Save Results As → CSV
-- Method 3 (Command Line): Use the export wrapper at the bottom of this script
-- ============================================================================
-- CTE 1: SchemaMetrics - Gather basic metrics for all tables
WITH SchemaMetrics AS (
SELECT
s.name AS schema_name,
t.name AS table_name,
-- Row count from partition statistics
-- More accurate than COUNT(*) and much faster
SUM(p.rows) AS row_count,
-- Table size calculation in MB
-- total_pages includes data + indexes + LOB storage
SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS size_mb,
-- Column count per table
COUNT(DISTINCT c.column_id) AS column_count,
-- Metadata timestamps for change tracking
t.modify_date,
t.create_date
FROM sys.schemas s
INNER JOIN sys.tables t ON s.schema_id = t.schema_id
LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1)
LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id
LEFT JOIN sys.columns c ON t.object_id = c.object_id
-- Exclude system schemas that have no business/personal data
WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest', 'db_owner', 'db_accessadmin',
'db_securityadmin', 'db_ddladmin', 'db_backupoperator', 'db_datareader',
'db_datawriter', 'db_denydatareader', 'db_denydatawriter')
GROUP BY s.name, t.name, t.modify_date, t.create_date
),
-- CTE 2: PIIIndicators - Detect PII columns using naming pattern analysis
-- This CTE categorizes columns into risk levels based on GDPR sensitivity
PIIIndicators AS (
SELECT
SCHEMA_NAME(t.schema_id) AS schema_name,
t.name AS table_name,
-- HIGH RISK COLUMNS (PII_CRITICAL)
-- These columns typically contain highly sensitive personal data
-- GDPR Articles 4, 9 - Special categories requiring explicit consent
COUNT(DISTINCT CASE
WHEN c.name LIKE '%Email%' OR c.name LIKE '%Mail%' THEN c.column_id
-- Email addresses: Direct contact, unique identifier
WHEN c.name LIKE '%Phone%' OR c.name LIKE '%Mobile%' THEN c.column_id
-- Phone numbers: Direct contact information
WHEN c.name LIKE '%SSN%' OR c.name LIKE '%National%ID%' THEN c.column_id
-- National identifiers: Government ID, highly sensitive
WHEN c.name LIKE '%Password%' OR c.name LIKE '%PWD%' THEN c.column_id
-- Credentials: Security risk if exposed
WHEN c.name LIKE '%Credit%' OR c.name LIKE '%Card%' THEN c.column_id
-- Financial data: Payment card information
END) AS high_risk_columns,
-- MEDIUM RISK COLUMNS (PII_MEDIUM)
-- Personal identifiers that combined can identify individuals
-- GDPR Article 4(1) - Personal data definition
COUNT(DISTINCT CASE
WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%' THEN c.column_id
-- Names: Core personal identifiers
WHEN c.name LIKE '%MiddleName%' THEN c.column_id
-- Middle names: Additional personal identifier
WHEN c.name LIKE '%Address%' OR c.name LIKE '%Street%' THEN c.column_id
-- Physical addresses: Location data
WHEN c.name LIKE '%City%' OR c.name LIKE '%PostalCode%' OR c.name LIKE '%Zip%' THEN c.column_id
-- Location identifiers: Geographic personal data
WHEN c.name LIKE '%BirthDate%' OR c.name LIKE '%DOB%' THEN c.column_id
-- Birth dates: Age-sensitive personal information
END) AS medium_risk_columns,
-- LOW RISK COLUMNS (PII_LOW)
-- Indirect identifiers, less sensitive but still personal data
COUNT(DISTINCT CASE
WHEN c.name LIKE '%Title%' AND SCHEMA_NAME(t.schema_id) = 'Person' THEN c.column_id
-- Titles (Mr., Mrs., Dr.): Personal honorifics
WHEN c.name LIKE '%Suffix%' THEN c.column_id
-- Name suffixes (Jr., Sr., III): Family relationships
WHEN c.name LIKE '%Gender%' OR c.name LIKE '%Sex%' THEN c.column_id
-- Gender: Demographic personal data
WHEN c.name LIKE '%Marital%' THEN c.column_id
-- Marital status: Personal demographic information
END) AS low_risk_columns
FROM sys.tables t
INNER JOIN sys.columns c ON t.object_id = c.object_id
WHERE SCHEMA_NAME(t.schema_id) NOT IN ('sys', 'INFORMATION_SCHEMA')
GROUP BY SCHEMA_NAME(t.schema_id), t.name
),
-- CTE 3: RelationshipCount - Analyze foreign key relationships
-- Used to understand data lineage and impact of schema changes
RelationshipCount AS (
SELECT
OBJECT_SCHEMA_NAME(fk.parent_object_id) AS schema_name,
OBJECT_NAME(fk.parent_object_id) AS table_name,
-- Outbound FKs: How many other tables this table references
COUNT(DISTINCT fk.object_id) AS outbound_fk_count,
-- Inbound FKs: How many tables reference this table
COUNT(DISTINCT fk.referenced_object_id) AS inbound_fk_count
FROM sys.foreign_keys fk
GROUP BY OBJECT_SCHEMA_NAME(fk.parent_object_id), OBJECT_NAME(fk.parent_object_id)
)
-- MAIN QUERY: Aggregate all metrics at schema level
SELECT
sm.schema_name,
COUNT(DISTINCT sm.table_name) AS table_count,
SUM(sm.row_count) AS total_rows,
CAST(SUM(sm.size_mb) AS DECIMAL(10,2)) AS total_size_mb,
SUM(sm.column_count) AS total_columns,
-- PII Column Counts by Risk Level
SUM(ISNULL(pii.high_risk_columns, 0)) AS high_risk_pii_columns,
SUM(ISNULL(pii.medium_risk_columns, 0)) AS medium_risk_pii_columns,
SUM(ISNULL(pii.low_risk_columns, 0)) AS low_risk_pii_columns,
-- COMPLIANCE RISK SCORE CALCULATION (0-100 scale)
-- Formula: (High×10 + Medium×5 + Low×2) ÷ Total Columns × 100
-- Rationale:
-- - High risk weighted 10x (critical impact if breached)
-- - Medium risk weighted 5x (significant impact)
-- - Low risk weighted 2x (minor impact but still PII)
-- - Normalized to total columns for fair comparison across schemas
CAST(
(SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) /
NULLIF(SUM(sm.column_count), 0) * 100 -- NULLIF prevents divide by zero
AS DECIMAL(5,2)) AS pii_risk_score,
-- COMPLIANCE PRIORITY ASSIGNMENT (UPDATED LOGIC v2.1)
-- Uses both absolute column counts AND risk scores for balanced assessment
--
-- CRITICAL: Schemas requiring immediate attention
-- - 3+ high-risk columns (multiple sensitive data types present)
-- - OR risk score ≥ 40 (>40% of columns are PII-weighted)
-- Examples: Person, HumanResources schemas
--
-- HIGH: Schemas requiring priority attention
-- - Any high-risk columns (at least one critical PII type)
-- - OR risk score ≥ 15 (>15% PII density)
-- Examples: Sales, Customer schemas
--
-- MEDIUM: Schemas requiring standard monitoring
-- - Any medium-risk columns (personal identifiers present)
-- - OR risk score ≥ 5 (>5% PII density)
-- Examples: Operational schemas with occasional PII
--
-- LOW: Schemas with minimal PII exposure
-- - Only low-risk columns OR risk score < 5
-- Examples: Product, Configuration schemas
CASE
WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) >= 3)
OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) /
NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 40)
THEN 'CRITICAL'
WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) > 0)
OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) /
NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 15)
THEN 'HIGH'
WHEN (SUM(ISNULL(pii.medium_risk_columns, 0)) > 0)
OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) /
NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 5)
THEN 'MEDIUM'
ELSE 'LOW'
END AS compliance_priority,
-- DATA CATEGORY CLASSIFICATION
-- Automatic classification based on schema naming conventions
-- Helps organize compliance efforts by data type
CASE
WHEN sm.schema_name LIKE '%Person%' OR sm.schema_name LIKE '%Customer%'
THEN 'Personal Data'
WHEN sm.schema_name LIKE '%HR%' OR sm.schema_name LIKE '%Human%'
OR sm.schema_name LIKE '%Employee%'
THEN 'Employee Data'
WHEN sm.schema_name LIKE '%Sales%' OR sm.schema_name LIKE '%Order%'
THEN 'Transaction Data'
WHEN sm.schema_name LIKE '%Financial%' OR sm.schema_name LIKE '%Payment%'
THEN 'Financial Data'
ELSE 'Operational Data'
END AS data_category,
-- Latest modification timestamp across all tables in schema
MAX(sm.modify_date) AS last_modified,
-- Ordered table list for reference
-- STRING_AGG requires SQL Server 2017+
-- For SQL Server 2016 and earlier, use the commented alternative below
STRING_AGG(sm.table_name, ', ') WITHIN GROUP (ORDER BY sm.table_name) AS table_list
-- ALTERNATIVE for SQL Server 2016 and earlier (uncomment if needed):
-- (SELECT STUFF((SELECT ', ' + sm2.table_name
-- FROM SchemaMetrics sm2
-- WHERE sm2.schema_name = sm.schema_name
-- ORDER BY sm2.table_name
-- FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 2, '')) AS table_list
FROM SchemaMetrics sm
LEFT JOIN PIIIndicators pii
ON sm.schema_name = pii.schema_name
AND sm.table_name = pii.table_name
LEFT JOIN RelationshipCount rc
ON sm.schema_name = rc.schema_name
AND sm.table_name = rc.table_name
GROUP BY sm.schema_name
-- ORDER BY: Prioritize results for action planning
-- Primary sort: Compliance priority (CRITICAL first)
-- Secondary sort: Total rows (larger datasets first within same priority)
ORDER BY
CASE
WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) >= 3)
OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) /
NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 40)
THEN 1 -- CRITICAL schemas first
WHEN (SUM(ISNULL(pii.high_risk_columns, 0)) > 0)
OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) /
NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 15)
THEN 2 -- HIGH schemas second
WHEN (SUM(ISNULL(pii.medium_risk_columns, 0)) > 0)
OR (CAST((SUM(ISNULL(pii.high_risk_columns, 0)) * 10.0 +
SUM(ISNULL(pii.medium_risk_columns, 0)) * 5.0 +
SUM(ISNULL(pii.low_risk_columns, 0)) * 2.0) /
NULLIF(SUM(sm.column_count), 0) * 100 AS DECIMAL(5,2)) >= 5)
THEN 3 -- MEDIUM schemas third
ELSE 4 -- LOW schemas last
END,
total_rows DESC;
-- ============================================================================
-- CSV EXPORT WRAPPER FOR VERSION 2
-- ============================================================================
-- INSTRUCTIONS:
-- 1. Uncomment the section below to export directly to CSV
-- 2. Update the file path to your desired location
-- 3. Ensure SQL Server has write permissions to the target directory
-- 4. Run this entire script block
-- ============================================================================
/*
-- Enable xp_cmdshell if not already enabled (requires admin rights)
-- EXEC sp_configure 'show advanced options', 1;
-- RECONFIGURE;
-- EXEC sp_configure 'xp_cmdshell', 1;
-- RECONFIGURE;
-- Export to CSV using BCP utility
DECLARE @sql NVARCHAR(MAX);
DECLARE @outputPath NVARCHAR(500) = 'C:\Exports\GDPR_Schema_Analysis.csv'; -- UPDATE THIS PATH
SET @sql = 'bcp "' + DB_NAME() + '.dbo.vw_gdpr_compliance_export" queryout "'
+ @outputPath + '" -c -t, -T -S ' + @@SERVERNAME;
EXEC xp_cmdshell @sql;
*/
-- ALTERNATIVE: Create a view for easier exports
-- CREATE VIEW vw_gdpr_compliance_export AS
-- [Paste the entire Version 2 query here]
-- Then export view results using your preferred tool
-- ============================================================================
-- PRIORITY THRESHOLDS EXPLAINED:
-- ============================================================================
--
-- THRESHOLD DESIGN RATIONALE:
-- These thresholds balance sensitivity with practicality
--
-- CRITICAL: 3+ high-risk columns OR risk score ≥ 40
-- Why: Multiple high-risk columns indicate concentrated PII exposure
-- Risk score ≥40 means >40% of schema contains weighted PII
-- Action Required: Immediate dictionary creation, data masking, audit logging
-- SLA: 1 week for initial assessment and protection
--
-- HIGH: Any high-risk columns OR risk score ≥ 15
-- Why: Even single high-risk columns (email, SSN) require protection
-- 15% threshold catches schemas with moderate PII concentration
-- Action Required: Priority dictionary creation, access controls
-- SLA: 2-3 weeks for protection implementation
--
-- MEDIUM: Any medium-risk columns OR risk score ≥ 5
-- Why: Medium-risk columns (names, addresses) still qualify as personal data
-- 5% threshold includes schemas with sparse but present PII
-- Action Required: Standard dictionary creation, monitoring
-- SLA: 4-6 weeks for implementation
--
-- LOW: Only low-risk columns OR risk score < 5
-- Why: Minimal PII exposure, primarily operational data
-- Action Required: Routine monitoring, no immediate action
-- SLA: Quarterly review cycle
--
-- ============================================================================
-- ============================================================================
-- VERSION 3: DETAILED TABLE-LEVEL ANALYSIS (Deep Dive)
-- ============================================================================
-- USE CASE: Detailed analysis of specific schemas
-- OUTPUT: Table-by-table breakdown with specific PII column identification
-- RUNTIME: Fast for single schema (~3-5 seconds)
-- WHEN TO USE: After Version 2 identifies high-priority schemas
--
-- INSTRUCTIONS:
-- 1. Replace 'Person' with your target schema name (line 270)
-- 2. Run query to see table-level detail
-- 3. Use results to plan dictionary terms and regex patterns
-- ============================================================================
WITH TablePIIAnalysis AS (
SELECT
SCHEMA_NAME(t.schema_id) AS schema_name,
t.name AS table_name,
t.create_date,
t.modify_date,
-- Table metrics
SUM(p.rows) AS row_count,
SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS size_mb,
COUNT(DISTINCT c.column_id) AS column_count,
-- PII COLUMN DETECTION WITH TYPE LABELS
-- Creates human-readable list of PII columns with their types
-- Format: "ColumnName (TYPE)", "ColumnName (TYPE)"
-- This output is perfect for:
-- - Building dictionary CSV files
-- - Creating regex patterns
-- - Documentation and audit reports
STRING_AGG(
CASE
-- Email/Mail columns: Direct contact, unique identifier
WHEN c.name LIKE '%Email%' OR c.name LIKE '%Mail%'
THEN c.name + ' (EMAIL)'
-- Phone columns: Contact information
WHEN c.name LIKE '%Phone%' OR c.name LIKE '%Mobile%' OR c.name LIKE '%Fax%'
THEN c.name + ' (PHONE)'
-- Credential columns: Critical security data
WHEN c.name LIKE '%Password%' OR c.name LIKE '%PWD%'
THEN c.name + ' (PASSWORD)'
-- National ID columns: Government identifiers
WHEN c.name LIKE '%SSN%' OR c.name LIKE '%National%ID%'
THEN c.name + ' (NATIONAL_ID)'
-- Financial columns: Payment information
WHEN c.name LIKE '%Credit%Card%'
THEN c.name + ' (CREDIT_CARD)'
-- Name columns: Personal identifiers
WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%'
OR c.name LIKE '%MiddleName%'
THEN c.name + ' (NAME)'
-- Address columns: Location data
WHEN c.name LIKE '%Address%' OR c.name LIKE '%Street%'
THEN c.name + ' (ADDRESS)'
-- Birth date columns: Age-sensitive data
WHEN c.name LIKE '%BirthDate%' OR c.name LIKE '%DOB%'
THEN c.name + ' (BIRTHDATE)'
-- Location columns: Geographic identifiers
WHEN c.name LIKE '%City%' OR c.name LIKE '%Postal%' OR c.name LIKE '%Zip%'
THEN c.name + ' (LOCATION)'
END,
', ' -- Comma separator for readability
) AS potential_pii_columns,
-- RELATIONSHIP COMPLEXITY ANALYSIS
-- Understanding relationships helps with:
-- - Data lineage mapping
-- - Impact assessment for data changes
-- - Identifying data flow patterns
-- Outbound relationships: Tables this table depends on
(SELECT COUNT(*)
FROM sys.foreign_keys fk
WHERE fk.parent_object_id = t.object_id) AS outbound_relationships,
-- Inbound relationships: Tables that depend on this table
(SELECT COUNT(*)
FROM sys.foreign_keys fk
WHERE fk.referenced_object_id = t.object_id) AS inbound_relationships
FROM sys.tables t
LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1)
LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id
LEFT JOIN sys.columns c ON t.object_id = c.object_id
WHERE SCHEMA_NAME(t.schema_id) = 'Person' -- ⚠️ CHANGE SCHEMA NAME HERE
GROUP BY SCHEMA_NAME(t.schema_id), t.name, t.create_date, t.modify_date, t.object_id
)
SELECT
schema_name,
table_name,
row_count,
CAST(size_mb AS DECIMAL(10,2)) AS size_mb,
column_count,
potential_pii_columns,
outbound_relationships,
inbound_relationships,
(outbound_relationships + inbound_relationships) AS total_relationships,
-- TABLE-LEVEL PRIORITIZATION
-- Assigns priority to individual tables within a schema
-- Useful for sequencing dictionary creation work
CASE
WHEN potential_pii_columns LIKE '%PASSWORD%'
OR potential_pii_columns LIKE '%CREDIT_CARD%'
OR potential_pii_columns LIKE '%NATIONAL_ID%'
THEN 'CRITICAL'
WHEN potential_pii_columns LIKE '%EMAIL%'
OR potential_pii_columns LIKE '%BIRTHDATE%'
OR potential_pii_columns LIKE '%ADDRESS%'
THEN 'HIGH'
WHEN potential_pii_columns LIKE '%NAME%'
OR potential_pii_columns LIKE '%PHONE%'
THEN 'MEDIUM'
WHEN potential_pii_columns IS NOT NULL
THEN 'LOW'
ELSE 'NONE'
END AS pii_priority,
-- Data freshness indicator
DATEDIFF(DAY, modify_date, GETDATE()) AS days_since_modified,
create_date,
modify_date
FROM TablePIIAnalysis
ORDER BY
-- Sort by priority level first
CASE
WHEN potential_pii_columns LIKE '%PASSWORD%' OR potential_pii_columns LIKE '%CREDIT_CARD%' THEN 1
WHEN potential_pii_columns LIKE '%EMAIL%' OR potential_pii_columns LIKE '%BIRTHDATE%' THEN 2
WHEN potential_pii_columns LIKE '%NAME%' OR potential_pii_columns LIKE '%PHONE%' THEN 3
WHEN potential_pii_columns IS NOT NULL THEN 4
ELSE 5
END,
-- Then by row count (larger tables first)
row_count DESC;
-- ============================================================================
-- VERSION 4: EXPORT-READY FORMAT (For Documentation & Reports)
-- ============================================================================
-- USE CASE: Clean output for compliance reports and executive summaries
-- OUTPUT: Formatted data ready for Excel export or compliance documentation
-- RUNTIME: Fast (<10 seconds)
-- WHEN TO USE: Monthly/Quarterly compliance reports, board presentations
--
-- KEY FEATURES:
-- - Human-readable number formatting (with commas)
-- - Binary indicators (Yes/No for PII presence)
-- - Clean column names for non-technical audiences
-- - Risk scores for quantitative tracking
-- ============================================================================
WITH SchemaInventory AS (
SELECT
-- Clean, business-friendly column aliases
s.name AS [Schema Name],
COUNT(DISTINCT t.name) AS [Table Count],
-- Format numbers with thousand separators for readability
-- FORMAT function converts to string with localization
FORMAT(SUM(p.rows), 'N0') AS [Total Rows],
-- Size in MB with 2 decimal places
CAST(SUM(CAST(a.total_pages AS BIGINT) * 8) / 1024.0 AS DECIMAL(10,2)) AS [Size (MB)],
COUNT(DISTINCT c.column_id) AS [Total Columns],
-- Calculate risk score using same formula as Version 2
-- Stored in CTE for reuse in priority calculation
CAST(
(COUNT(DISTINCT CASE
WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%'
OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
OR c.name LIKE '%Credit%' OR c.name LIKE '%National%ID%'
THEN c.column_id END) * 10.0 +
COUNT(DISTINCT CASE
WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%LastName%'
OR c.name LIKE '%Address%' OR c.name LIKE '%BirthDate%'
THEN c.column_id END) * 5.0 +
COUNT(DISTINCT CASE
WHEN c.name LIKE '%Title%' OR c.name LIKE '%Gender%'
THEN c.column_id END) * 2.0) /
NULLIF(COUNT(DISTINCT c.column_id), 0) * 100
AS DECIMAL(5,2)) AS risk_score,
-- Business-friendly data classification
CASE
WHEN s.name LIKE '%Person%' OR s.name LIKE '%Customer%'
THEN 'Contains Personal Data'
WHEN s.name LIKE '%HR%' OR s.name LIKE '%Human%' OR s.name LIKE '%Employee%'
THEN 'Contains Employee Data'
WHEN s.name LIKE '%Sales%'
THEN 'Contains Transaction Data'
ELSE 'Operational Data'
END AS [Data Classification],
-- Binary PII indicator for quick scanning
CASE
WHEN COUNT(DISTINCT CASE
WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%'
OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
THEN c.column_id END) > 0
THEN 'Yes'
ELSE 'No'
END AS [Contains PII],
-- Priority calculation with updated thresholds
CASE
WHEN COUNT(DISTINCT CASE
WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%'
OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
THEN c.column_id END) >= 3
OR CAST((COUNT(DISTINCT CASE
WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%'
OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
THEN c.column_id END) * 10.0) /
NULLIF(COUNT(DISTINCT c.column_id), 0) * 100 AS DECIMAL(5,2)) >= 40
THEN 'CRITICAL'
WHEN COUNT(DISTINCT CASE
WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%'
OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
THEN c.column_id END) > 0
OR CAST((COUNT(DISTINCT CASE
WHEN c.name LIKE '%Email%' OR c.name LIKE '%Phone%'
OR c.name LIKE '%SSN%' OR c.name LIKE '%Password%'
THEN c.column_id END) * 10.0) /
NULLIF(COUNT(DISTINCT c.column_id), 0) * 100 AS DECIMAL(5,2)) >= 15
THEN 'HIGH'
WHEN COUNT(DISTINCT CASE
WHEN c.name LIKE '%FirstName%' OR c.name LIKE '%Address%'
THEN c.column_id END) > 0
THEN 'MEDIUM'
ELSE 'LOW'
END AS [Priority],
-- Formatted date for reports
FORMAT(MAX(t.modify_date), 'yyyy-MM-dd') AS [Last Modified],
-- Semicolon-separated list (easier to read than commas in CSV)
STRING_AGG(t.name, '; ') WITHIN GROUP (ORDER BY t.name) AS [Tables]
FROM sys.schemas s
LEFT JOIN sys.tables t ON s.schema_id = t.schema_id
LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0, 1)
LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id
LEFT JOIN sys.columns c ON t.object_id = c.object_id
WHERE s.name NOT IN ('sys', 'INFORMATION_SCHEMA', 'guest')
GROUP BY s.name
)
SELECT
[Schema Name],
[Table Count],
[Total Rows],
[Size (MB)],
[Total Columns],
[Data Classification],
[Contains PII],
[Priority],
CAST(risk_score AS DECIMAL(5,2)) AS [Risk Score],
[Last Modified],
[Tables]
FROM SchemaInventory
ORDER BY
-- Sort by priority for executive summary
CASE [Priority]
WHEN 'CRITICAL' THEN 1
WHEN 'HIGH' THEN 2
WHEN 'MEDIUM' THEN 3
ELSE 4
END,
-- Then by data volume
CAST(REPLACE([Total Rows], ',', '') AS BIGINT) DESC;
-- ============================================================================
-- COMPREHENSIVE USAGE GUIDE & RECOMMENDATIONS:
-- ============================================================================
--
-- WORKFLOW SEQUENCE:
--
-- 1. INITIAL ASSESSMENT (Week 1)
-- → Run VERSION 2 (Compliance-Focused)
-- → Identify CRITICAL and HIGH priority schemas
-- → Document baseline risk scores
-- → Present findings to stakeholders
--
-- 2. DEEP DIVE ANALYSIS (Week 2-3)
-- → Run VERSION 3 (Detailed) for each CRITICAL schema
-- → List specific PII columns discovered
-- → Map data relationships and dependencies
-- → Create data flow diagrams
--
-- 3. DICTIONARY PLANNING (Week 3-4)
-- → Use VERSION 3 results to build dictionary CSV files
-- → Extract actual values for dictionary terms
-- → Create regex patterns based on column names
-- → Define confidence scores and conditions
--
-- 4. COMPLIANCE REPORTING (Ongoing)
-- → Run VERSION 4 (Export-Ready) monthly/quarterly
-- → Track risk score trends over time
-- → Generate executive compliance reports
-- → Document remediation progress
--
-- 5. ROUTINE MONITORING (Ongoing)
-- → Run VERSION 1 (Basic Enhanced) weekly
-- → Alert on new tables in Person/HR schemas
-- → Monitor schema modification dates
-- → Track data volume growth
--
-- INTEGRATION RECOMMENDATIONS:
--
-- - Schedule VERSION 2 as SQL Agent Job (monthly)
-- - Create alerts for risk score increases >10 points
-- - Integrate results with data governance dashboard
-- - Link to data lineage visualization tools
-- - Feed into data catalog metadata
--
-- CUSTOMIZATION TIPS:
--
-- - Add industry-specific PII patterns (HIPAA, PCI-DSS, etc.)
-- - Adjust risk score weights based on business impact
-- - Modify priority thresholds to match org risk appetite
-- - Extend data categories for domain-specific classification
-- - Add columns for data retention policies
--
-- ============================================================================
-- PII DETECTION PATTERNS REFERENCE:
-- ============================================================================
--
-- HIGH RISK PATTERNS (Weight: 10x):
-- %Email%, %Mail% → Email addresses (unique identifiers)
-- %Phone%, %Mobile%, %Fax% → Phone numbers (direct contact)
-- %SSN%, %National%ID% → Government identifiers
-- %Password%, %PWD% → Authentication credentials
-- %Credit%, %Card% → Payment card information
--
-- MEDIUM RISK PATTERNS (Weight: 5x):
-- %FirstName%, %LastName% → Personal names
-- %MiddleName% → Additional name identifiers
-- %Address%, %Street% → Physical addresses
-- %City%, %PostalCode% → Location data
-- %BirthDate%, %DOB% → Age-sensitive information
--
-- LOW RISK PATTERNS (Weight: 2x):
-- %Title% → Honorifics (Mr., Dr., etc.)
-- %Suffix% → Name suffixes (Jr., Sr.)
-- %Gender%, %Sex% → Demographic data
-- %Marital% → Marital status
--
-- TO EXTEND PATTERNS:
-- 1. Add new CASE statements in PIIIndicators CTE
-- 2. Update risk score calculation if adding new risk levels
-- 3. Document new patterns in this reference section
-- 4. Test on sample schemas before production deployment
--
-- EXAMPLES OF EXTENSIONS:
-- - HIPAA: %Diagnosis%, %Prescription%, %Treatment%
-- - PCI-DSS: %CVV%, %CardHolder%, %Expiry%
-- - Industry: %Contract%, %Agreement%, %NDA%
--
-- ============================================================================
-- TROUBLESHOOTING:
-- ============================================================================
--
-- Issue: Row counts showing as 0
-- Solution: Run UPDATE STATISTICS or REBUILD indexes
--
-- Issue: STRING_AGG function not found
-- Solution: Upgrade to SQL Server 2017+ or use FOR XML PATH alternative
--
-- Issue: Performance slow on large databases
-- Solution: Add WHERE clause to filter specific schemas
-- Create indexed view on sys.columns for better performance
--
-- Issue: Risk scores seem too high/low
-- Solution: Adjust weights in risk score formula (currently 10/5/2)
-- Review and update PII pattern matches
--
-- Issue: Missing expected schemas in output
-- Solution: Check schema name against exclusion list (line 41)
-- Verify schema contains actual tables (not just empty schema)
--
-- ============================================================================
-- END OF SCRIPT
-- ============================================================================Last updated
Was this helpful?



