Database Inventory
Overview
In this foundational hands-on workshop, you'll learn how to conduct a comprehensive database inventory that serves as the cornerstone of any effective data governance program. We'll guide you through the systematic process of cataloging and documenting all data assets across your organization, emphasizing that you cannot govern what you don't know exists.
By the end of this workshop, you will be able to:
Conduct systematic database discovery and cataloging
Document data assets for comprehensive organizational visibility
Assess data-related risks and identify compliance gaps
Establish clear data ownership and stewardship assignments
Map governance policies to specific datasets and systems
Create baseline documentation that supports regulatory compliance requirements
Workshop Attendees:
Database Administrator (Technical Lead)
Data Governance Officer
Business Process Owners for Each Schema
Information Security Representative
Compliance/Legal Representative (if regulated industry)
This workshop will provide you with the essential foundation needed for effective data asset management, risk assessment, ownership assignment, policy application, and compliance mapping across your data landscape.
The following workshops are designed to give you an overview of the key principles when conducting an analysis of your database inventory.
Nowadays, Data discovery in large data warehouses leverages several ML algorithms, each suited for different aspects of understanding and extracting insights from massive datasets. Here are the key approaches:
K-means, DBSCAN, and Hierarchical Clustering help identify natural groupings in data without predefined labels. These are particularly useful for customer segmentation, anomaly detection, and finding similar data patterns across different tables or datasets.
Random Forests, Gradient Boosting (XGBoost, LightGBM), and Decision Trees are employed to understand relationships between variables and predict outcomes. They're especially valuable for identifying which features are most important in determining specific business outcomes.
PCA (Principal Component Analysis), t-SNE, and Autoencoders help visualize and understand high-dimensional data by reducing it to manageable dimensions while preserving important patterns. This is crucial when dealing with hundreds or thousands of variables.
Apriori and FP-Growth algorithms discover interesting relationships between variables in large databases, commonly used for market basket analysis and finding co-occurring patterns in transactional data.
Isolation Forests, One-Class SVM, and Local Outlier Factor (LOF) identify unusual patterns that don't conform to expected behavior, critical for fraud detection, quality control, and data quality assessment.
Deep Neural Networks, LSTMs for time-series data, and Graph Neural Networks for understanding complex relationships in interconnected data. These are particularly powerful for unstructured data like text fields, logs, or when dealing with complex temporal patterns.
Topic Modeling (LDA), Named Entity Recognition, and Transformer models process text fields in databases to extract meaningful information from unstructured text columns, comments, or documentation.
Kolmogorov-Smirnov tests, Benford's Law analysis, and correlation analysis help understand data distributions, detect data quality issues, and identify statistical relationships between columns.
Inventory Analysis
Let's jump straight into the deep end ..!
There's are a ton of tools that will help you scan and analyze your DW - Data Warehouse. Once connected to the AdventureWorks2022 database we're going to run an Inventory query:
Automated PII detection and risk assessment
Usage pattern analysis for optimization
Compliance mapping to meet regulatory requirements
Operational flags for maintenance & security
Data Quality indicators & sizing metrics
DBeaver
DBeaver Community is a free cross-platform database tool for developers, database administrators, analysts, and everyone working with data. It supports all popular SQL databases like MySQL, MariaDB, PostgreSQL, SQLite, Apache Family, and more.
From the drop-down list, select SQL Server

Enter the connection details:
Host
localhost / pdc.pentaho.lab
Port
1443
Database / Schema
AdventureWorks2022
Username
sa
Password
StrongPassword123
Show All Schemas
✅ ON
Trust Server Certificate
✅ ON

Click: Test Connection.
Download the database driver.

Successful connection ..!

Expand the Schemas.

Inventory Analysis
We're going to create a temporary table - the advantage of this approach is that multiple workshop participants can each create their own #DataInventory
table simultaneously without conflicts, since each session gets its own copy.
Execute the following script that conducts a comprehensive analysis of AdventureWorks2022 database, holding the results in a tmp table.
-- ============================================================================
-- ENHANCED GOVERNANCE-FOCUSED INVENTORY QUERY v2.0
-- ============================================================================
-- Purpose: Comprehensive database inventory with governance, compliance,
-- security, and operational metadata for enterprise data management
--
-- Features:
-- • Automated PII detection and risk assessment
-- • Usage pattern analysis for optimization
-- • Compliance mapping with regulatory requirements
-- • Operational flags for maintenance and security
-- • Data quality indicators and sizing metrics
--
-- Target Audience: Data governance teams, DBAs, compliance officers, security
-- Database: AdventureWorks2022 (adaptable to any SQL Server database)
-- ============================================================================
-- ============================================================================
-- SECTION 1: DATABASE CONNECTION AND ENVIRONMENT VERIFICATION
-- ============================================================================
-- Connect to target database (modify as needed)
USE AdventureWorks2022;
-- Verify connection and capture environment information
SELECT
@@SERVERNAME AS server_name, -- SQL Server instance name
DB_NAME() AS database_name, -- Current database name
@@VERSION AS sql_version, -- SQL Server version info
GETDATE() AS [current_time], -- Current timestamp (bracketed to avoid reserved keyword)
USER_NAME() AS [current_user], -- Current database user (bracketed to avoid reserved keyword)
SYSTEM_USER AS login_name; -- Current system login name
-- Display connection confirmation messages
PRINT 'Connected to database: ' + DB_NAME();
PRINT 'Starting governance inventory analysis at: ' + CAST(GETDATE() AS VARCHAR(50));
-- ============================================================================
-- SECTION 2: TEMPORARY TABLE CREATION FOR RESULTS STORAGE
-- ============================================================================
-- Drop existing temp table if it exists (for re-runs)
IF OBJECT_ID('tempdb..#DataGovernanceInventory') IS NOT NULL
DROP TABLE #DataGovernanceInventory;
-- Create comprehensive temporary table to store inventory results
CREATE TABLE #DataGovernanceInventory (
-- ========================================================================
-- CORE IDENTIFICATION FIELDS
-- ========================================================================
inventory_id INT IDENTITY(1,1) PRIMARY KEY, -- Unique identifier for each record
schema_name NVARCHAR(128) NOT NULL, -- Database schema name
table_name NVARCHAR(128) NOT NULL, -- Table name within schema
full_table_name NVARCHAR(257) NOT NULL, -- Fully qualified table name
-- ========================================================================
-- STRUCTURAL ANALYSIS METRICS
-- ========================================================================
total_columns INT NOT NULL, -- Total number of columns
required_columns INT NOT NULL, -- Count of NOT NULL columns
identity_columns INT NOT NULL, -- Count of identity/auto-increment columns
computed_columns INT NOT NULL, -- Count of computed/calculated columns
-- ========================================================================
-- STORAGE AND SIZE METRICS
-- ========================================================================
estimated_rows BIGINT NOT NULL, -- Approximate row count
total_size_mb DECIMAL(10,2) NOT NULL, -- Total allocated space in MB
used_size_mb DECIMAL(10,2) NOT NULL, -- Actually used space in MB
data_size_mb DECIMAL(10,2) NOT NULL, -- Pure data size in MB
-- ========================================================================
-- TEMPORAL METADATA
-- ========================================================================
create_date DATETIME NOT NULL, -- Table creation timestamp
modify_date DATETIME NOT NULL, -- Last structural modification
days_since_modified INT, -- Age of last modification
-- ========================================================================
-- USAGE PATTERN ANALYTICS
-- ========================================================================
total_read_operations BIGINT, -- Total SELECT operations since restart
total_write_operations BIGINT, -- Total DML operations since restart
days_since_last_access INT, -- Days since last read operation
-- ========================================================================
-- DATA CLASSIFICATION AND QUALITY
-- ========================================================================
size_category VARCHAR(20) NOT NULL, -- Table size classification
potential_pii_columns INT NOT NULL, -- Count of suspected PII columns
pii_risk_level VARCHAR(20) NOT NULL, -- PII risk assessment
-- ========================================================================
-- BUSINESS CONTEXT AND PURPOSE
-- ========================================================================
business_purpose NVARCHAR(500) NOT NULL, -- Business function mapping
data_sensitivity VARCHAR(50) NOT NULL, -- Data sensitivity classification
compliance_scope NVARCHAR(500) NOT NULL, -- Applicable regulations
-- ========================================================================
-- OPERATIONAL MANAGEMENT
-- ========================================================================
operational_flag VARCHAR(100) NOT NULL, -- Operational recommendations/alerts
-- ========================================================================
-- EXTENDED GOVERNANCE FIELDS (FOR FUTURE ENHANCEMENT)
-- ========================================================================
data_owner NVARCHAR(100) NULL, -- Business data owner
data_steward NVARCHAR(100) NULL, -- Technical data steward
retention_period_years INT NULL, -- Data retention requirement in years
backup_frequency VARCHAR(50) NULL, -- Required backup frequency
archive_eligible BIT DEFAULT 0, -- Whether table is archival candidate
-- ========================================================================
-- AUDIT AND TRACKING
-- ========================================================================
analysis_timestamp DATETIME DEFAULT GETDATE(), -- When this analysis was performed
analysis_user NVARCHAR(128) DEFAULT SYSTEM_USER -- Who performed the analysis
);
PRINT 'Temporary table #DataGovernanceInventory created successfully';
-- ============================================================================
-- SECTION 3: EXECUTE COMPREHENSIVE INVENTORY QUERY AND POPULATE TABLE
-- ============================================================================
PRINT 'Starting comprehensive data inventory analysis...';
-- ============================================================================
-- CTE 1: TABLE SIZE METRICS
-- ============================================================================
-- Purpose: Pre-calculate accurate storage metrics to avoid complex GROUP BY
-- Benefits: Better performance, cleaner main query, separate data/used/total sizes
WITH TableSizeMetrics AS (
SELECT
t.object_id,
-- Total allocated space including indexes and unused space
SUM(a.total_pages) * 8 / 1024.0 AS size_mb,
-- Actually used space (excludes free space in pages)
SUM(a.used_pages) * 8 / 1024.0 AS used_size_mb,
-- Pure data size (excludes indexes and free space)
SUM(a.data_pages) * 8 / 1024.0 AS data_size_mb
FROM sys.tables t
JOIN sys.partitions p ON t.object_id = p.object_id
JOIN sys.allocation_units a ON p.partition_id = a.container_id
WHERE p.index_id IN (0,1) -- Heap tables (0) and clustered indexes (1) only
GROUP BY t.object_id
),
-- ============================================================================
-- CTE 2: PII DETECTION ENGINE
-- ============================================================================
-- Purpose: Automatically identify potential PII columns based on naming patterns
-- Compliance Impact: Critical for GDPR, CCPA, and other privacy regulations
-- Note: This is heuristic-based; manual review recommended for accuracy
PIIDetection AS (
SELECT
t.object_id,
-- Count columns that match common PII naming patterns
COUNT(CASE WHEN c.name LIKE '%ssn%' OR c.name LIKE '%social%' -- Social Security
OR c.name LIKE '%email%' OR c.name LIKE '%phone%' -- Contact info
OR c.name LIKE '%address%' OR c.name LIKE '%birth%' -- Personal details
OR c.name LIKE '%salary%' OR c.name LIKE '%wage%' -- Financial info
OR c.name LIKE '%credit%' OR c.name LIKE '%account%' -- Account data
OR c.name LIKE '%passport%' OR c.name LIKE '%license%' -- ID documents
THEN 1 END) AS potential_pii_columns
FROM sys.tables t
JOIN sys.columns c ON t.object_id = c.object_id
GROUP BY t.object_id
),
-- ============================================================================
-- CTE 3: TABLE USAGE STATISTICS
-- ============================================================================
-- Purpose: Capture table access patterns for operational optimization
-- Benefits: Identify unused tables, performance bottlenecks, archival candidates
-- Note: Statistics reset on SQL Server restart
TableStats AS (
SELECT
t.object_id,
-- Last access timestamps for different operation types
s.last_user_seek, -- Last index seek operation
s.last_user_scan, -- Last table/index scan operation
s.last_user_lookup, -- Last key lookup operation
-- Aggregate read operations (seeks + scans + lookups)
s.user_seeks + s.user_scans + s.user_lookups AS total_reads,
-- Write operations (INSERT, UPDATE, DELETE)
s.user_updates
FROM sys.tables t
LEFT JOIN sys.dm_db_index_usage_stats s ON t.object_id = s.object_id AND s.index_id IN (0,1)
)
-- ============================================================================
-- INSERT GOVERNANCE INVENTORY RESULTS INTO TEMPORARY TABLE
-- ============================================================================
-- Note: Extended governance fields (data_owner, data_steward, etc.) will be NULL initially
-- Note: analysis_timestamp and analysis_user fields will auto-populate with DEFAULT values
INSERT INTO #DataGovernanceInventory (
-- Core identification fields
schema_name, table_name, full_table_name,
-- Structural analysis metrics
total_columns, required_columns, identity_columns, computed_columns,
-- Storage and size metrics
estimated_rows, total_size_mb, used_size_mb, data_size_mb,
-- Temporal metadata
create_date, modify_date, days_since_modified,
-- Usage pattern analytics
total_read_operations, total_write_operations, days_since_last_access,
-- Data classification and quality
size_category, potential_pii_columns, pii_risk_level,
-- Business context and governance
business_purpose, data_sensitivity, compliance_scope, operational_flag
)
-- ============================================================================
-- MAIN QUERY: COMPREHENSIVE INVENTORY WITH GOVERNANCE METADATA
-- ============================================================================
SELECT
-- ========================================================================
-- SECTION 1: CORE TABLE IDENTIFICATION
-- ========================================================================
s.name AS schema_name, -- Schema containing the table
t.name AS table_name, -- Table name within schema
SCHEMA_NAME(t.schema_id) + '.' + t.name AS full_table_name, -- Fully qualified name for referencing
-- ========================================================================
-- SECTION 2: ENHANCED COLUMN ANALYSIS
-- ========================================================================
-- Provides detailed breakdown of column types for data modeling insights
COUNT(c.column_id) AS total_columns, -- Total number of columns in the table
SUM(CASE WHEN c.is_nullable = 0 THEN 1 ELSE 0 END) AS required_columns, -- Count of NOT NULL columns (data quality indicator)
SUM(CASE WHEN c.is_identity = 1 THEN 1 ELSE 0 END) AS identity_columns, -- Count of auto-increment/identity columns
SUM(CASE WHEN c.is_computed = 1 THEN 1 ELSE 0 END) AS computed_columns, -- Count of calculated/computed columns
-- ========================================================================
-- SECTION 3: STORAGE METRICS (ENHANCED ACCURACY)
-- ========================================================================
-- More accurate storage calculations using pre-computed CTE results
COALESCE(p.rows, 0) AS estimated_rows, -- Estimated row count (may not be exact, updated by statistics)
COALESCE(tsm.size_mb, 0) AS total_size_mb, -- Total allocated space in MB (includes indexes and free space)
COALESCE(tsm.used_size_mb, 0) AS used_size_mb, -- Actually used space in MB (excludes free space in pages)
COALESCE(tsm.data_size_mb, 0) AS data_size_mb, -- Pure data size in MB (excludes indexes and free space)
-- ========================================================================
-- SECTION 4: TEMPORAL METADATA FOR CHANGE TRACKING
-- ========================================================================
t.create_date, -- When the table was originally created
t.modify_date, -- Last time the table structure was modified
DATEDIFF(day, t.modify_date, GETDATE()) AS days_since_modified, -- Age of last structural change
-- ========================================================================
-- SECTION 5: USAGE PATTERNS FOR OPERATIONAL INSIGHTS
-- ========================================================================
-- Critical for identifying unused tables and optimization opportunities
-- Note: Statistics reset on SQL Server restart, so interpret with caution for new servers
COALESCE(ts.total_reads, 0) AS total_read_operations, -- Total SELECT operations since last restart
COALESCE(ts.user_updates, 0) AS total_write_operations, -- Total INSERT/UPDATE/DELETE operations
-- Calculate days since last access (any type of read operation)
CASE
WHEN ts.last_user_seek IS NOT NULL OR ts.last_user_scan IS NOT NULL OR ts.last_user_lookup IS NOT NULL
THEN DATEDIFF(day, COALESCE(ts.last_user_seek, ts.last_user_scan, ts.last_user_lookup), GETDATE())
ELSE NULL -- NULL means no recorded access (could be new table or no activity since restart)
END AS days_since_last_access,
-- ========================================================================
-- SECTION 6: DATA QUALITY INDICATORS
-- ========================================================================
-- Categorize tables by size for different governance approaches
CASE
WHEN COALESCE(p.rows, 0) = 0 THEN 'EMPTY' -- No data - potential candidate for removal
WHEN COALESCE(p.rows, 0) < 1000 THEN 'SMALL' -- Reference/lookup tables, low impact
WHEN COALESCE(p.rows, 0) < 100000 THEN 'MEDIUM' -- Standard operational tables
WHEN COALESCE(p.rows, 0) < 1000000 THEN 'LARGE' -- High-volume transactional tables
ELSE 'VERY_LARGE' -- Enterprise-scale data requiring special attention
END AS size_category,
-- ========================================================================
-- SECTION 7: PII RISK ASSESSMENT FOR PRIVACY COMPLIANCE
-- ========================================================================
-- Automated PII detection helps prioritize privacy protection efforts
-- Note: This is heuristic-based detection - manual review recommended for accuracy
COALESCE(pii.potential_pii_columns, 0) AS potential_pii_columns, -- Count of columns with PII-like names
-- Risk-based classification for privacy governance prioritization
CASE
WHEN COALESCE(pii.potential_pii_columns, 0) = 0 THEN 'NO_PII_DETECTED' -- Likely safe from privacy regulations
WHEN COALESCE(pii.potential_pii_columns, 0) <= 2 THEN 'LOW_PII_RISK' -- Minimal privacy exposure
WHEN COALESCE(pii.potential_pii_columns, 0) <= 5 THEN 'MEDIUM_PII_RISK' -- Moderate privacy exposure
ELSE 'HIGH_PII_RISK' -- High privacy exposure - priority for review
END AS pii_risk_level,
-- ========================================================================
-- SECTION 8: ENHANCED BUSINESS CONTEXT MAPPING
-- ========================================================================
-- Maps technical schemas to business functions for stakeholder communication
-- This helps non-technical stakeholders understand data purpose and importance
CASE
WHEN s.name = 'Person' THEN 'Customer and Employee Identity Management'
WHEN s.name = 'Sales' THEN 'Revenue Operations and Customer Transactions'
WHEN s.name = 'Production' THEN 'Manufacturing and Inventory Management'
WHEN s.name = 'Purchasing' THEN 'Vendor Relations and Procurement'
WHEN s.name = 'HumanResources' THEN 'Employee Lifecycle Management'
-- Enhanced detection for system and operational schemas
WHEN s.name LIKE '%Log%' OR s.name LIKE '%Audit%' THEN 'Audit and Logging'
WHEN s.name LIKE '%Config%' OR s.name LIKE '%Setting%' THEN 'System Configuration'
ELSE 'System/Reference Data' -- Default for unrecognized schemas
END AS business_purpose,
-- ========================================================================
-- SECTION 9: DATA SENSITIVITY CLASSIFICATION (WITH PII AWARENESS)
-- ========================================================================
-- Enhanced sensitivity classification that considers both schema context and PII content
-- This drives security controls, access restrictions, and compliance requirements
CASE
-- High sensitivity: Known sensitive schemas OR detected high PII content
WHEN s.name IN ('Person', 'HumanResources') OR COALESCE(pii.potential_pii_columns, 0) > 3
THEN 'HIGH - Personal/Sensitive Data'
-- Medium sensitivity: Business-critical data OR any PII detected
WHEN s.name = 'Sales' OR COALESCE(pii.potential_pii_columns, 0) > 0
THEN 'MEDIUM - Business Sensitive'
-- Medium operational: Core business operations data
WHEN s.name IN ('Production', 'Purchasing')
THEN 'MEDIUM - Operational Data'
-- Low sensitivity: Reference data and system tables
ELSE 'LOW - Reference Data'
END AS data_sensitivity,
-- ========================================================================
-- SECTION 10: EXPANDED COMPLIANCE MAPPING
-- ========================================================================
-- Maps data to relevant regulatory frameworks for compliance planning
-- Helps legal and compliance teams understand regulatory exposure
CASE
-- High PII content triggers comprehensive privacy regulations
WHEN s.name = 'Person' OR COALESCE(pii.potential_pii_columns, 0) > 2
THEN 'GDPR, CCPA, PIPEDA, PII Protection'
-- HR data has specific employment law and privacy requirements
WHEN s.name = 'HumanResources'
THEN 'Employment Law, GDPR, CCPA, PII Protection'
-- Financial data requires SOX compliance and potentially PCI-DSS
WHEN s.name = 'Sales'
THEN 'SOX, Financial Reporting, PCI-DSS (if payment data)'
-- Audit trails have specific retention and integrity requirements
WHEN s.name LIKE '%Log%' OR s.name LIKE '%Audit%'
THEN 'SOX, Audit Requirements, Data Retention'
-- Default corporate governance for other data types
ELSE 'Standard Corporate Governance'
END AS compliance_scope,
-- ========================================================================
-- SECTION 11: OPERATIONAL RECOMMENDATIONS ENGINE
-- ========================================================================
-- Automated flags to guide operational decisions and risk management
-- These flags help DBAs and data stewards prioritize their work
CASE
-- Flag empty tables that may be obsolete and safe to remove
WHEN COALESCE(p.rows, 0) = 0 AND DATEDIFF(day, t.create_date, GETDATE()) > 90
THEN 'REVIEW - Empty table older than 90 days'
-- Flag unused tables that may be candidates for archival to save storage costs
WHEN COALESCE(ts.total_reads, 0) = 0 AND DATEDIFF(day, t.create_date, GETDATE()) > 180
THEN 'REVIEW - No reads in 180+ days, consider archival'
-- Security concern: PII detected in unexpected schemas (potential data leakage)
WHEN COALESCE(pii.potential_pii_columns, 0) > 0 AND s.name NOT IN ('Person', 'HumanResources')
THEN 'SECURITY - Unexpected PII in non-HR/Person schema'
-- Performance concern: Large tables consuming resources but rarely accessed
WHEN COALESCE(tsm.size_mb, 0) > 1000 AND COALESCE(ts.total_reads, 0) < 100
THEN 'PERFORMANCE - Large table with low usage'
-- No issues detected - table appears to be operating normally
ELSE 'NORMAL'
END AS operational_flag
-- ============================================================================
-- FROM CLAUSE: SYSTEM CATALOG JOINS FOR COMPREHENSIVE METADATA
-- ============================================================================
FROM sys.schemas s
-- Core table information - every user table belongs to a schema
JOIN sys.tables t ON s.schema_id = t.schema_id
-- Column details for structural analysis - every table has at least one column
JOIN sys.columns c ON t.object_id = c.object_id
-- Row count estimates (LEFT JOIN handles edge cases gracefully)
-- Note: Some tables may not have partition info if they're brand new
LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0,1)
-- Pre-calculated size metrics from CTE (LEFT JOIN handles tables with no allocation units)
LEFT JOIN TableSizeMetrics tsm ON t.object_id = tsm.object_id
-- PII detection results from CTE (LEFT JOIN ensures all tables included even if no PII detected)
LEFT JOIN PIIDetection pii ON t.object_id = pii.object_id
-- Usage statistics from CTE (LEFT JOIN handles tables with no recorded usage)
LEFT JOIN TableStats ts ON t.object_id = ts.object_id
-- ============================================================================
-- FILTERS: SCOPE AND DATA QUALITY CONTROLS
-- ============================================================================
WHERE t.type = 'U' -- User tables only - excludes views ('V'), system tables, functions ('FN', 'IF'), etc.
-- ============================================================================
-- GROUPING: REQUIRED FOR AGGREGATE FUNCTIONS
-- ============================================================================
-- Group by all non-aggregated columns to enable COUNT() and SUM() functions
-- This is required because we're counting columns and summing identity/computed column flags
GROUP BY
-- Core identifiers
s.name, t.name, t.schema_id, t.object_id,
-- Partition data (already aggregated at table level)
p.rows,
-- Temporal fields (single values per table)
t.create_date, t.modify_date,
-- Pre-calculated metrics from CTEs (already aggregated)
tsm.size_mb, tsm.used_size_mb, tsm.data_size_mb,
-- PII detection results (already aggregated)
pii.potential_pii_columns,
-- Usage statistics (already aggregated)
ts.total_reads, ts.user_updates, ts.last_user_seek, ts.last_user_scan, ts.last_user_lookup
-- ============================================================================
-- ORDERING: RISK-BASED PRIORITIZATION FOR GOVERNANCE
-- ============================================================================
-- Sort results to show highest-risk and most important data first
-- This helps governance teams focus on the most critical items
ORDER BY
-- Primary sort: PII risk level (highest risk first)
CASE
WHEN COALESCE(pii.potential_pii_columns, 0) > 3 THEN 1 -- High PII risk tables get top priority
WHEN s.name IN ('Person', 'HumanResources') THEN 2 -- Known sensitive schemas come second
ELSE 3 -- Other data comes last
END,
-- Secondary sort: Alphabetical by schema within priority groups
s.name,
-- Tertiary sort: Largest tables first within each schema (size often correlates with importance)
COALESCE(p.rows, 0) DESC;
-- ============================================================================
-- SECTION 4: RESULTS VERIFICATION AND SUMMARY REPORTING
-- ============================================================================
PRINT 'Governance inventory analysis completed successfully!';
-- Verify insertion success and provide record count (without using variables to avoid batch issues)
SELECT 'Total tables analyzed: ' + CAST(COUNT(*) AS VARCHAR(10)) AS record_count_message
FROM #DataGovernanceInventory;
-- ========================================================================
-- EXECUTIVE SUMMARY: HIGH-LEVEL GOVERNANCE METRICS
-- ========================================================================
-- Provides key metrics for executive reporting and governance dashboards
SELECT
'GOVERNANCE INVENTORY SUMMARY - ' + CAST(GETDATE() AS VARCHAR(20)) AS report_section,
COUNT(*) AS total_tables_analyzed, -- Total number of user tables in database
SUM(estimated_rows) AS total_estimated_rows, -- Combined row count across all tables
CAST(SUM(total_size_mb) AS DECIMAL(10,2)) AS total_database_size_mb, -- Total storage consumption
-- Risk indicators for executive attention
COUNT(CASE WHEN pii_risk_level IN ('HIGH_PII_RISK', 'MEDIUM_PII_RISK') THEN 1 END) AS tables_with_privacy_risk,
COUNT(CASE WHEN operational_flag <> 'NORMAL' THEN 1 END) AS tables_requiring_attention,
-- Compliance scope overview
COUNT(DISTINCT compliance_scope) AS different_compliance_frameworks
FROM #DataGovernanceInventory;
-- ========================================================================
-- DATA SENSITIVITY DISTRIBUTION ANALYSIS
-- ========================================================================
-- Shows how data is distributed across sensitivity levels for resource allocation
SELECT
'DATA SENSITIVITY DISTRIBUTION' AS report_section,
data_sensitivity,
COUNT(*) AS table_count, -- Number of tables at this sensitivity level
CAST(SUM(total_size_mb) AS DECIMAL(10,2)) AS total_size_mb, -- Storage consumed by this sensitivity level
CAST(AVG(estimated_rows) AS BIGINT) AS avg_rows_per_table, -- Average table size in this category
-- Calculate percentage of total tables
CAST(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM #DataGovernanceInventory) AS DECIMAL(5,1)) AS percent_of_tables
FROM #DataGovernanceInventory
GROUP BY data_sensitivity
ORDER BY
-- Sort by sensitivity priority for governance focus
CASE data_sensitivity
WHEN 'HIGH - Personal/Sensitive Data' THEN 1
WHEN 'MEDIUM - Business Sensitive' THEN 2
WHEN 'MEDIUM - Operational Data' THEN 3
ELSE 4
END;
-- ========================================================================
-- HIGH PRIORITY ITEMS REQUIRING IMMEDIATE GOVERNANCE ACTION
-- ========================================================================
-- Identifies tables that need immediate attention from governance, security, or operations teams
SELECT
'HIGH PRIORITY ITEMS FOR IMMEDIATE REVIEW' AS report_section,
schema_name,
table_name,
pii_risk_level, -- Privacy risk assessment
data_sensitivity, -- Business sensitivity classification
operational_flag, -- Specific operational concern
estimated_rows, -- Size context for impact assessment
CAST(total_size_mb AS DECIMAL(10,2)) AS size_mb, -- Storage impact
days_since_last_access, -- Usage pattern indicator
compliance_scope -- Regulatory implications
FROM #DataGovernanceInventory
WHERE
-- Include tables with privacy risks
pii_risk_level IN ('HIGH_PII_RISK', 'MEDIUM_PII_RISK')
-- Or tables with operational issues
OR operational_flag <> 'NORMAL'
ORDER BY
-- Prioritize by risk level
CASE pii_risk_level
WHEN 'HIGH_PII_RISK' THEN 1
WHEN 'MEDIUM_PII_RISK' THEN 2
ELSE 3
END,
-- Then by operational concern
CASE operational_flag
WHEN 'SECURITY - Unexpected PII in non-HR/Person schema' THEN 1
WHEN 'PERFORMANCE - Large table with low usage' THEN 2
WHEN 'REVIEW - No reads in 180+ days, consider archival' THEN 3
ELSE 4
END,
-- Finally by size (larger tables get more attention)
total_size_mb DESC;
-- ========================================================================
-- COMPLETION STATUS AND NEXT STEPS
-- ========================================================================
-- Provides clear indication of completion and guidance for next actions
SELECT
'ANALYSIS COMPLETED' AS status,
'Analysis completed at: ' + CAST(GETDATE() AS VARCHAR(50)) AS completion_timestamp,
'Results stored in: #DataGovernanceInventory (session scope)' AS temp_table_location,
'Records available until session ends or server restart' AS data_retention_note,
'Use additional queries below for deeper analysis' AS next_steps;
-- ============================================================================
-- USAGE NOTES AND RECOMMENDATIONS
-- ============================================================================
-- 1. Run during off-peak hours for best performance on large databases
-- 2. Consider creating indexed views for frequently accessed governance reports
-- 3. Review PII detection results manually - heuristic matching has limitations
-- 4. Usage statistics reset on SQL Server restart - consider baseline measurements
-- 5. For regulatory compliance, supplement with data flow analysis and classification
-- 6. Operational flags provide starting points - always validate before taking action
-- ============================================================================
-- ============================================================================
-- SECTION 5: OPTIONAL ADVANCED ANALYSIS QUERIES
-- ============================================================================
-- These queries provide additional insights for specific governance use cases
-- Uncomment and modify as needed for your specific requirements
/*
-- ========================================================================
-- SCHEMA-LEVEL SUMMARY FOR DEPARTMENTAL REPORTING
-- ========================================================================
-- Useful for understanding data distribution across business areas
-- Helps with resource allocation and departmental data governance planning
SELECT
'SCHEMA-LEVEL SUMMARY' AS report_type,
schema_name,
COUNT(*) as table_count, -- Number of tables in this schema
SUM(estimated_rows) as total_rows, -- Combined rows across all tables
CAST(SUM(total_size_mb) AS DECIMAL(10,2)) as total_size_mb, -- Storage consumption by schema
CAST(AVG(potential_pii_columns) AS DECIMAL(3,1)) as avg_pii_columns_per_table, -- PII density
-- Calculate percentage of database size consumed by this schema
CAST(SUM(total_size_mb) * 100.0 / (SELECT SUM(total_size_mb) FROM #DataGovernanceInventory) AS DECIMAL(5,1)) AS percent_of_database_size
FROM #DataGovernanceInventory
GROUP BY schema_name
ORDER BY total_size_mb DESC;
-- ========================================================================
-- COMPLIANCE FRAMEWORK IMPACT ANALYSIS
-- ========================================================================
-- Shows which regulatory frameworks affect how much data
-- Critical for compliance budgeting and risk assessment
SELECT
'COMPLIANCE IMPACT ANALYSIS' AS report_type,
compliance_scope,
COUNT(*) as affected_tables, -- Number of tables under this compliance scope
CAST(SUM(total_size_mb) AS DECIMAL(10,2)) as total_data_size_mb, -- Amount of data subject to these regulations
SUM(estimated_rows) as total_affected_rows, -- Number of records under regulatory scope
-- Show percentage of database under this compliance framework
CAST(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM #DataGovernanceInventory) AS DECIMAL(5,1)) AS percent_of_tables
FROM #DataGovernanceInventory
GROUP BY compliance_scope
ORDER BY affected_tables DESC;
-- ========================================================================
-- OPERATIONAL EFFICIENCY OPPORTUNITIES
-- ========================================================================
-- Identifies specific optimization opportunities with potential impact
-- Helps prioritize operational improvements and cost savings
SELECT
'OPERATIONAL OPTIMIZATION OPPORTUNITIES' AS report_type,
operational_flag,
COUNT(*) as table_count,
CAST(SUM(total_size_mb) AS DECIMAL(10,2)) as potential_storage_impact_mb, -- Storage that could be reclaimed
SUM(estimated_rows) as affected_rows,
-- Show tables that could be archived or optimized
STRING_AGG(schema_name + '.' + table_name, ', ') WITHIN GROUP (ORDER BY total_size_mb DESC) as example_tables
FROM #DataGovernanceInventory
WHERE operational_flag <> 'NORMAL' -- Only show tables with actionable recommendations
GROUP BY operational_flag
ORDER BY potential_storage_impact_mb DESC;
-- ========================================================================
-- PII RISK HEAT MAP BY BUSINESS AREA
-- ========================================================================
-- Cross-tabulates business purpose with PII risk for targeted remediation
-- Helps security teams focus on the highest-risk business areas
SELECT
'PII RISK BY BUSINESS AREA' AS report_type,
business_purpose,
SUM(CASE WHEN pii_risk_level = 'HIGH_PII_RISK' THEN 1 ELSE 0 END) as high_risk_tables,
SUM(CASE WHEN pii_risk_level = 'MEDIUM_PII_RISK' THEN 1 ELSE 0 END) as medium_risk_tables,
SUM(CASE WHEN pii_risk_level = 'LOW_PII_RISK' THEN 1 ELSE 0 END) as low_risk_tables,
SUM(CASE WHEN pii_risk_level = 'NO_PII_DETECTED' THEN 1 ELSE 0 END) as no_pii_detected,
COUNT(*) as total_tables_in_area,
-- Calculate risk percentage for this business area
CAST((SUM(CASE WHEN pii_risk_level IN ('HIGH_PII_RISK', 'MEDIUM_PII_RISK') THEN 1 ELSE 0 END) * 100.0 / COUNT(*)) AS DECIMAL(5,1)) as pii_risk_percentage
FROM #DataGovernanceInventory
GROUP BY business_purpose
ORDER BY high_risk_tables DESC, medium_risk_tables DESC;
-- ========================================================================
-- DATA LIFECYCLE MANAGEMENT CANDIDATES
-- ========================================================================
-- Identifies tables for archival, retention policy application, or cleanup
-- Supports data lifecycle management and storage cost optimization
SELECT
'DATA LIFECYCLE MANAGEMENT CANDIDATES' AS report_type,
schema_name + '.' + table_name as full_table_name,
size_category,
estimated_rows,
CAST(total_size_mb AS DECIMAL(10,2)) as size_mb,
days_since_last_access,
days_since_modified,
operational_flag,
-- Suggest lifecycle action based on usage patterns
CASE
WHEN days_since_last_access > 365 AND total_size_mb > 100 THEN 'ARCHIVE - Large unused table'
WHEN days_since_last_access > 180 AND estimated_rows = 0 THEN 'DELETE - Empty unused table'
WHEN days_since_modified > 730 AND days_since_last_access > 90 THEN 'ARCHIVE - Stale data'
WHEN total_size_mb > 1000 AND days_since_last_access > 30 THEN 'REVIEW - Large infrequently used'
ELSE 'RETAIN - Active or recent'
END as lifecycle_recommendation
FROM #DataGovernanceInventory
WHERE
-- Focus on tables that could benefit from lifecycle management
days_since_last_access > 30 OR estimated_rows = 0 OR total_size_mb > 500
ORDER BY
CASE
WHEN days_since_last_access > 365 AND total_size_mb > 100 THEN 1
WHEN days_since_last_access > 180 AND estimated_rows = 0 THEN 2
ELSE 3
END,
total_size_mb DESC;
-- ========================================================================
-- EXPORT TO PERMANENT TABLE (OPTIONAL)
-- ========================================================================
-- Uncomment and modify the table name to save results permanently
-- Useful for historical tracking and trend analysis over time
-- Note: Modify the table name to include date/time for version control
/*
SELECT
GETDATE() as snapshot_date, -- Add timestamp for historical tracking
*
INTO dbo.DataGovernanceInventory_20250806 -- Modify date as needed
FROM #DataGovernanceInventory;
PRINT 'Results exported to permanent table: dbo.DataGovernanceInventory_20250806';
*/
-- ========================================================================
-- CUSTOM ANALYSIS TEMPLATE
-- ========================================================================
-- Template for additional custom analysis based on specific business needs
/*
SELECT
'CUSTOM ANALYSIS' AS report_type,
-- Add your custom fields here
schema_name,
table_name,
business_purpose,
data_sensitivity,
-- Add custom calculations or groupings as needed
COUNT(*) OVER (PARTITION BY schema_name) as tables_in_schema,
RANK() OVER (ORDER BY total_size_mb DESC) as size_rank
FROM #DataGovernanceInventory
WHERE
-- Add your custom filters here
data_sensitivity LIKE '%HIGH%' -- Example: focus on high sensitivity data
ORDER BY
-- Add your custom sorting here
total_size_mb DESC;
*/
Analysis Results
Ok. there's a lot to get through .. so let's review the summary highlights:
The script generates 6 tmp tables:
Results 1 - Basic connection details.
Results 1 (2) - Total tables analyzed
Results 1 (3) - Governance Summary
Results 1 (4) - Data Sensitivity Distribution
Results 1 (5) - High Priority Items for Immediate Review
Results 1 (6) - Analysis completed details

Take a look at:
Results 1 (5) - High Priority Items for Immediate Review
Let's dive into the main part of the report .. bear in mind this data is representative..!
CRITICAL:
High-Risk PII Exposure: All Person and HumanResources schema tables are flagged as HIGH_PII_RISK
with HIGH - Personal/Sensitive Data
sensitivity.
Key Problem Areas:
Person.Person (19,972 rows) - Contains personal identity data
Person.Password (19,972 rows) - 🚨 CRITICAL: Password data should be heavily encrypted
Person.EmailAddress (19,972 rows) - Email addresses are prime PII
HumanResources.Employee (290 rows) - Employee personal data
HumanResources.JobCandidate (13 rows) - Candidate personal information
Major Classification Error: Sales.CreditCard table shows "NO_PII_DETECTED" despite obviously containing payment card data. This is a serious compliance risk for PCI-DSS requirements.
KEY CONCERNS:
Inconsistent Logic: Rows 20-41 show NO_PII_DETECTED
but still have MEDIUM
sensitivity levels.
Examples of Misclassification:
Sales.SalesOrderHeader (31,465 rows) - Large table marked as no PII but medium sensitivity
Sales.Customer (19,820 rows) - Customer table with no PII detected (likely wrong)
Sales.CreditCard (19,118 rows) - 🚨 CRITICAL: Credit card table showing no PII!
Security & Compliance Risks:
Password table exists with 19K+ records - needs immediate encryption audit
Credit card data not properly classified as PII - major compliance risk
Large volume exposure: Person schema alone has ~20K records of personal data
Data Governance Gaps:
PII detection logic is missing financial data patterns (credit cards, customer IDs)
Inconsistent classification between PII risk and sensitivity levels
No operational flags showing beyond "REVIE" (appears truncated)
Compliance Exposure:
All high-risk items fall under GDPR, CCPA, PIPEDA - significant regulatory exposure
Employment Law compliance needed for HR data
PCI-DSS implications for credit card data not properly flagged
So looks like there's a couple of issues to address:
IMMEDIATE NEXT STEPS:
Phase 1: Urgent Security Actions (This Week)
Audit Password Storage: Verify Person.Password table encryption immediately
Review Credit Card Data: Investigate why Sales.CreditCard shows NO_PII_DETECTED
Access Control Review: Restrict access to all HIGH_PII_RISK tables
Data Encryption Audit: Ensure all personal data is encrypted at rest
Phase 2: Classification Fixes (Next 2 Weeks)
Enhance PII Detection: Add patterns for customer IDs, credit cards, account numbers
Manual Review: Validate all tables showing NO_PII but MEDIUM sensitivity
Schema Analysis: Deep dive into Sales schema for hidden PII
Update Classifications: Fix inconsistent risk vs sensitivity mappings
Phase 3: Governance Framework (Next Month)
Data Stewardship: Assign data owners for each schema
Access Policies: Implement role-based access for PII data
Monitoring Setup: Deploy data access monitoring for high-risk tables
Compliance Documentation: Document data flows for regulatory compliance
PII Assessment
The Inventory Analysis has given us a great overview.
However, where to start?
The Inventory Analysis identified:
High-Risk PII Exposure: All Person and HumanResources schema tables are flagged as HIGH_PII_RISK
with HIGH - Personal/Sensitive Data
sensitivity.
The analysis was conducted at a table level, we need to drill down further:
Column-by-column PII detection with detailed categorization
Risk-based classification with specific recommendations
Compliance framework mapping for regulatory requirements
Data Catalog output for Governance tools
Execute the following script - in DBeaver - that conducts a PII Risk Assessment at a column level:
-- ============================================================================
-- COLUMN-LEVEL PII DETECTION AND GOVERNANCE ANALYSIS SCRIPT
-- ============================================================================
-- Purpose: Analyze database at the individual column level to identify PII,
-- classify data sensitivity, and provide actionable governance recommendations
--
-- Features:
-- • Column-by-column PII detection with detailed categorization
-- • Risk-based classification with specific recommendations
-- • Compliance framework mapping for regulatory requirements
-- • Data catalog-ready output for governance tools
-- • Granular analysis beyond table-level assessment
--
-- Target Audience: Data governance teams, privacy officers, compliance teams
-- Database: AdventureWorks2022 (adaptable to any SQL Server database)
-- ============================================================================
-- ============================================================================
-- SECTION 1: DATABASE CONNECTION AND SETUP
-- ============================================================================
-- Connect to target database (modify as needed)
USE AdventureWorks2022;
-- ============================================================================
-- SECTION 2: TEMPORARY TABLE FOR COLUMN-LEVEL ANALYSIS
-- ============================================================================
-- Drop existing temp table if it exists (for re-runs)
IF OBJECT_ID('tempdb..#ColumnLevelPII') IS NOT NULL
DROP TABLE #ColumnLevelPII;
-- Create comprehensive temporary table to store column-level PII analysis
CREATE TABLE #ColumnLevelPII (
-- ========================================================================
-- CORE IDENTIFICATION FIELDS
-- ========================================================================
analysis_id INT IDENTITY(1,1) PRIMARY KEY, -- Unique identifier for each column record
schema_name NVARCHAR(128) NOT NULL, -- Database schema name
table_name NVARCHAR(128) NOT NULL, -- Table name within schema
column_name NVARCHAR(128) NOT NULL, -- Individual column name
full_column_name NVARCHAR(393) NOT NULL, -- Fully qualified column name (schema.table.column)
-- ========================================================================
-- TECHNICAL COLUMN METADATA
-- ========================================================================
data_type NVARCHAR(128) NOT NULL, -- Complete data type with length/precision
max_length INT, -- Maximum length for string types
is_nullable BIT NOT NULL, -- Whether column allows NULL values
is_identity BIT NOT NULL, -- Whether column is auto-increment/identity
is_computed BIT NOT NULL, -- Whether column is computed/calculated
-- ========================================================================
-- PII CLASSIFICATION AND RISK ASSESSMENT
-- ========================================================================
pii_category VARCHAR(50), -- Specific type of PII (e.g., PERSONAL_NAME, CONTACT_INFO)
pii_risk_level VARCHAR(20) NOT NULL, -- Risk level assessment (HIGH/MEDIUM/LOW/NO_PII)
data_sensitivity VARCHAR(50) NOT NULL, -- Business sensitivity classification
-- ========================================================================
-- GOVERNANCE AND COMPLIANCE
-- ========================================================================
compliance_scope NVARCHAR(500) NOT NULL, -- Applicable regulatory frameworks
recommended_action VARCHAR(200), -- Specific action recommendations (ENCRYPT, MASK, etc.)
-- ========================================================================
-- AUDIT AND TRACKING
-- ========================================================================
analysis_timestamp DATETIME DEFAULT GETDATE() -- When this analysis was performed
);
PRINT 'Starting column-level PII analysis...';
-- ============================================================================
-- SECTION 3: MAIN COLUMN-LEVEL PII ANALYSIS QUERY
-- ============================================================================
-- Insert comprehensive column-level analysis results
INSERT INTO #ColumnLevelPII (
schema_name, table_name, column_name, full_column_name,
data_type, max_length, is_nullable, is_identity, is_computed,
pii_category, pii_risk_level, data_sensitivity, compliance_scope, recommended_action
)
SELECT
-- ========================================================================
-- BASIC COLUMN IDENTIFICATION
-- ========================================================================
s.name AS schema_name, -- Schema containing the table
t.name AS table_name, -- Table containing the column
c.name AS column_name, -- Individual column name
s.name + '.' + t.name + '.' + c.name AS full_column_name, -- Complete column path for referencing
-- ========================================================================
-- DETAILED DATA TYPE INFORMATION
-- ========================================================================
-- Build complete data type string with appropriate length/precision details
ty.name +
CASE
-- For character types, include length
WHEN ty.name IN ('varchar', 'nvarchar', 'char', 'nchar')
THEN '(' + CAST(c.max_length AS VARCHAR(10)) + ')'
-- For numeric types, include precision and scale
WHEN ty.name IN ('decimal', 'numeric')
THEN '(' + CAST(c.precision AS VARCHAR(10)) + ',' + CAST(c.scale AS VARCHAR(10)) + ')'
-- For other types, no additional info needed
ELSE ''
END AS data_type,
c.max_length, -- Raw maximum length value
c.is_nullable, -- NULL constraint information
c.is_identity, -- Identity column flag
c.is_computed, -- Computed column flag
-- ========================================================================
-- COMPREHENSIVE PII CATEGORY CLASSIFICATION
-- ========================================================================
-- Categorize columns based on naming patterns and content type indicators
CASE
-- ====================================================================
-- GOVERNMENT AND IDENTITY DOCUMENTS (Highest Risk)
-- ====================================================================
WHEN c.name LIKE '%ssn%' OR c.name LIKE '%social%security%'
THEN 'GOVERNMENT_ID' -- Social Security Numbers
WHEN c.name LIKE '%passport%' OR c.name LIKE '%license%' OR c.name LIKE '%permit%'
THEN 'IDENTIFICATION_DOCUMENTS' -- Official ID documents
-- ====================================================================
-- PERSONAL NAMES (High Risk for Identity)
-- ====================================================================
WHEN c.name LIKE '%first%name%' OR c.name LIKE '%fname%' OR c.name LIKE '%given%name%'
THEN 'PERSONAL_NAME' -- First/given names
WHEN c.name LIKE '%last%name%' OR c.name LIKE '%surname%' OR c.name LIKE '%family%name%'
THEN 'PERSONAL_NAME' -- Last/family names
WHEN c.name LIKE '%middle%name%' OR c.name LIKE '%full%name%' OR c.name LIKE '%display%name%'
THEN 'PERSONAL_NAME' -- Middle names and full names
WHEN c.name LIKE '%maiden%' OR c.name LIKE '%suffix%' OR c.name LIKE '%prefix%' OR c.name LIKE '%title%'
THEN 'PERSONAL_NAME' -- Name components and titles
-- ====================================================================
-- CONTACT INFORMATION (Direct Communication Channels)
-- ====================================================================
WHEN c.name LIKE '%email%' OR c.name LIKE '%mail%'
THEN 'CONTACT_INFO' -- Email addresses
WHEN c.name LIKE '%phone%' OR c.name LIKE '%mobile%' OR c.name LIKE '%telephone%' OR c.name LIKE '%fax%'
THEN 'CONTACT_INFO' -- Phone numbers and communication
WHEN c.name LIKE '%address%' OR c.name LIKE '%street%' OR c.name LIKE '%city%' OR c.name LIKE '%zip%' OR c.name LIKE '%postal%'
THEN 'ADDRESS_INFO' -- Physical address components
-- ====================================================================
-- DEMOGRAPHIC INFORMATION (Personal Characteristics)
-- ====================================================================
WHEN c.name LIKE '%birth%' OR c.name LIKE '%dob%' OR c.name LIKE '%age%'
THEN 'DEMOGRAPHIC' -- Birth dates and age information
WHEN c.name LIKE '%gender%' OR c.name LIKE '%sex%'
THEN 'DEMOGRAPHIC' -- Gender/sex identification
WHEN c.name LIKE '%nationality%' OR c.name LIKE '%ethnicity%' OR c.name LIKE '%race%'
THEN 'DEMOGRAPHIC' -- Ethnic and national origin
WHEN c.name LIKE '%marital%' OR c.name LIKE '%religion%'
THEN 'DEMOGRAPHIC' -- Personal status and beliefs
-- ====================================================================
-- FINANCIAL INFORMATION (Monetary and Credit Data)
-- ====================================================================
WHEN c.name LIKE '%salary%' OR c.name LIKE '%wage%' OR c.name LIKE '%income%' OR c.name LIKE '%pay%'
THEN 'FINANCIAL' -- Compensation and income data
WHEN c.name LIKE '%credit%' OR c.name LIKE '%bank%' OR c.name LIKE '%account%'
THEN 'FINANCIAL' -- Banking and credit information
WHEN c.name LIKE '%tax%' OR c.name LIKE '%ssn%'
THEN 'FINANCIAL' -- Tax-related identifiers
-- ====================================================================
-- HEALTH INFORMATION (Medical and Wellness Data)
-- ====================================================================
WHEN c.name LIKE '%medical%' OR c.name LIKE '%health%' OR c.name LIKE '%diagnosis%'
THEN 'HEALTH' -- Medical records and health data
WHEN c.name LIKE '%disability%' OR c.name LIKE '%condition%' OR c.name LIKE '%treatment%'
THEN 'HEALTH' -- Health conditions and treatments
-- ====================================================================
-- SENSITIVE AUTHENTICATION AND BIOMETRIC DATA
-- ====================================================================
WHEN c.name LIKE '%password%' OR c.name LIKE '%secret%' OR c.name LIKE '%token%'
THEN 'AUTHENTICATION' -- Authentication credentials
WHEN c.name LIKE '%biometric%' OR c.name LIKE '%fingerprint%' OR c.name LIKE '%photo%'
THEN 'BIOMETRIC' -- Biometric identifiers
-- ====================================================================
-- NO SPECIFIC PII CATEGORY DETECTED
-- ====================================================================
ELSE NULL -- Column doesn't match known PII patterns
END AS pii_category,
-- ========================================================================
-- COLUMN-LEVEL PII RISK ASSESSMENT
-- ========================================================================
-- Assign risk levels based on data type and potential privacy impact
CASE
-- ====================================================================
-- HIGH RISK: Government IDs, Financial Data, Authentication
-- ====================================================================
-- These columns pose the highest privacy and security risks
WHEN c.name LIKE '%ssn%' OR c.name LIKE '%social%security%'
OR c.name LIKE '%passport%' OR c.name LIKE '%tax%id%'
OR c.name LIKE '%credit%card%' OR c.name LIKE '%bank%account%'
OR c.name LIKE '%password%' OR c.name LIKE '%secret%'
THEN 'HIGH_PII_RISK'
-- ====================================================================
-- MEDIUM RISK: Personal Identifiers, Contact Info, Demographics
-- ====================================================================
-- These columns contain personally identifiable information
WHEN c.name LIKE '%first%name%' OR c.name LIKE '%last%name%' OR c.name LIKE '%full%name%'
OR c.name LIKE '%email%' OR c.name LIKE '%phone%' OR c.name LIKE '%address%'
OR c.name LIKE '%birth%' OR c.name LIKE '%dob%'
OR c.name LIKE '%salary%' OR c.name LIKE '%wage%'
OR c.name LIKE '%medical%' OR c.name LIKE '%health%'
THEN 'MEDIUM_PII_RISK'
-- ====================================================================
-- LOW RISK: General Demographics, Preferences
-- ====================================================================
-- These columns contain less sensitive personal information
WHEN c.name LIKE '%gender%' OR c.name LIKE '%age%' OR c.name LIKE '%title%'
OR c.name LIKE '%suffix%' OR c.name LIKE '%prefix%'
OR c.name LIKE '%nationality%' OR c.name LIKE '%ethnicity%'
THEN 'LOW_PII_RISK'
-- ====================================================================
-- POTENTIAL PII: Schema-Based Detection
-- ====================================================================
-- For known sensitive schemas, flag string and date columns for manual review
WHEN s.name IN ('Person', 'HumanResources') AND
(ty.name IN ('varchar', 'nvarchar', 'char', 'nchar') OR ty.name LIKE '%date%')
THEN 'POTENTIAL_PII'
-- ====================================================================
-- NO PII DETECTED
-- ====================================================================
-- Column doesn't match any PII patterns
ELSE 'NO_PII_DETECTED'
END AS pii_risk_level,
-- ========================================================================
-- DATA SENSITIVITY CLASSIFICATION FOR BUSINESS CONTEXT
-- ========================================================================
-- Classify data sensitivity from business and regulatory perspective
CASE
-- Highly sensitive: Government IDs, financial data, authentication
WHEN c.name LIKE '%ssn%' OR c.name LIKE '%passport%' OR c.name LIKE '%tax%'
OR c.name LIKE '%credit%' OR c.name LIKE '%bank%' OR c.name LIKE '%password%'
THEN 'HIGHLY_SENSITIVE'
-- Sensitive: Personal identifiers and contact information
WHEN c.name LIKE '%name%' OR c.name LIKE '%email%' OR c.name LIKE '%phone%'
OR c.name LIKE '%address%' OR c.name LIKE '%birth%' OR c.name LIKE '%salary%'
THEN 'SENSITIVE'
-- Moderately sensitive: General demographics and preferences
WHEN c.name LIKE '%gender%' OR c.name LIKE '%age%' OR c.name LIKE '%title%'
THEN 'MODERATELY_SENSITIVE'
-- Not sensitive: System and reference data
ELSE 'NOT_SENSITIVE'
END AS data_sensitivity,
-- ========================================================================
-- REGULATORY COMPLIANCE FRAMEWORK MAPPING
-- ========================================================================
-- Map each column to relevant regulatory requirements for compliance planning
CASE
-- Government IDs and tax data: Multiple regulatory frameworks
WHEN c.name LIKE '%ssn%' OR c.name LIKE '%tax%' OR c.name LIKE '%passport%'
THEN 'GDPR, CCPA, PIPEDA, SOX, Government ID Protection'
-- Personal identifiers: Privacy regulations
WHEN c.name LIKE '%name%' OR c.name LIKE '%email%' OR c.name LIKE '%phone%' OR c.name LIKE '%address%'
THEN 'GDPR, CCPA, PIPEDA, PII Protection'
-- Demographic data: Privacy and anti-discrimination laws
WHEN c.name LIKE '%birth%' OR c.name LIKE '%age%' OR c.name LIKE '%gender%'
THEN 'GDPR, CCPA, Demographic Data Protection'
-- Financial data: Financial regulations and employment law
WHEN c.name LIKE '%salary%' OR c.name LIKE '%wage%' OR c.name LIKE '%credit%'
THEN 'SOX, Financial Data Protection, Employment Law'
-- Health data: Health information privacy laws
WHEN c.name LIKE '%medical%' OR c.name LIKE '%health%'
THEN 'HIPAA, Health Data Protection, GDPR'
-- HR schema: Employment and privacy law
WHEN s.name = 'HumanResources'
THEN 'Employment Law, GDPR, CCPA'
-- Default: Standard data protection requirements
ELSE 'Standard Data Protection'
END AS compliance_scope,
-- ========================================================================
-- ACTIONABLE RECOMMENDATIONS FOR DATA GOVERNANCE
-- ========================================================================
-- Provide specific, actionable recommendations for each column type
CASE
-- Highest security: Encryption and strict access control
WHEN c.name LIKE '%ssn%' OR c.name LIKE '%passport%' OR c.name LIKE '%credit%' OR c.name LIKE '%password%'
THEN 'ENCRYPT - Apply strong encryption and restrict access'
-- Moderate security: Data masking for non-production environments
WHEN c.name LIKE '%name%' OR c.name LIKE '%email%' OR c.name LIKE '%phone%'
THEN 'MASK - Consider data masking for non-production'
-- Access control: Limit access to authorized personnel
WHEN c.name LIKE '%address%' OR c.name LIKE '%birth%' OR c.name LIKE '%salary%'
THEN 'RESTRICT - Limit access to authorized personnel only'
-- Monitoring: Track access and usage patterns
WHEN c.name LIKE '%gender%' OR c.name LIKE '%age%' OR c.name LIKE '%title%'
THEN 'MONITOR - Track access and usage patterns'
-- Manual review: Potential PII requiring human validation
WHEN s.name IN ('Person', 'HumanResources') AND
(ty.name IN ('varchar', 'nvarchar', 'char', 'nchar') OR ty.name LIKE '%date%')
THEN 'REVIEW - Manual review recommended for PII classification'
-- Standard controls: Apply standard data governance
ELSE 'STANDARD - Apply standard data governance controls'
END AS recommended_action
-- ============================================================================
-- FROM CLAUSE: SYSTEM CATALOG JOINS FOR COLUMN METADATA
-- ============================================================================
FROM sys.schemas s
-- Core table information
JOIN sys.tables t ON s.schema_id = t.schema_id
-- Individual column details
JOIN sys.columns c ON t.object_id = c.object_id
-- Data type information with precision/scale details
JOIN sys.types ty ON c.user_type_id = ty.user_type_id
WHERE t.type = 'U' -- User tables only (exclude views, functions, etc.)
-- ============================================================================
-- ORDERING: RISK-BASED PRIORITIZATION
-- ============================================================================
-- Sort results to show highest-risk columns first for immediate attention
ORDER BY
CASE
-- Highest priority: Government IDs and financial data
WHEN c.name LIKE '%ssn%' OR c.name LIKE '%passport%' OR c.name LIKE '%credit%' THEN 1
-- High priority: Personal names and contact information
WHEN c.name LIKE '%name%' OR c.name LIKE '%email%' OR c.name LIKE '%phone%' THEN 2
-- Medium priority: Known sensitive schemas
WHEN s.name IN ('Person', 'HumanResources') THEN 3
-- Lower priority: Other columns
ELSE 4
END,
s.name, t.name, c.name; -- Alphabetical within priority groups
PRINT 'Column-level PII analysis completed!';
-- ============================================================================
-- SECTION 4: COMPREHENSIVE REPORTING AND ANALYSIS
-- ============================================================================
-- ========================================================================
-- REPORT 1: HIGH-RISK PII COLUMNS REQUIRING IMMEDIATE ATTENTION
-- ========================================================================
-- Shows columns that need immediate security and governance action
SELECT
'HIGH-RISK PII COLUMNS REQUIRING IMMEDIATE ATTENTION' AS report_section,
schema_name, -- Schema containing the column
table_name, -- Table containing the column
column_name, -- Specific column name
data_type, -- Data type for context
pii_category, -- Type of PII detected
pii_risk_level, -- Risk assessment level
recommended_action, -- Specific action to take
compliance_scope -- Regulatory requirements
FROM #ColumnLevelPII
WHERE pii_risk_level IN ('HIGH_PII_RISK', 'MEDIUM_PII_RISK') -- Focus on actionable risk levels
ORDER BY
CASE pii_risk_level WHEN 'HIGH_PII_RISK' THEN 1 ELSE 2 END, -- High risk first
schema_name, table_name, column_name; -- Alphabetical within risk level
-- ========================================================================
-- REPORT 2: PII DISTRIBUTION BY CATEGORY
-- ========================================================================
-- Provides overview of PII types found across the database
SELECT
'PII DISTRIBUTION BY CATEGORY' AS report_section,
COALESCE(pii_category, 'NO_PII') AS pii_category, -- Handle NULL categories
COUNT(*) AS column_count, -- Total columns in this category
-- Count columns by risk level within each category
COUNT(CASE WHEN pii_risk_level = 'HIGH_PII_RISK' THEN 1 END) AS high_risk_columns,
COUNT(CASE WHEN pii_risk_level = 'MEDIUM_PII_RISK' THEN 1 END) AS medium_risk_columns,
COUNT(CASE WHEN pii_risk_level = 'LOW_PII_RISK' THEN 1 END) AS low_risk_columns
FROM #ColumnLevelPII
GROUP BY pii_category
ORDER BY high_risk_columns DESC, medium_risk_columns DESC; -- Most critical categories first
-- ========================================================================
-- REPORT 3: SCHEMA-LEVEL PII RISK SUMMARY
-- ========================================================================
-- Shows PII distribution and risk levels by database schema
SELECT
'PII RISK BY SCHEMA' AS report_section,
schema_name, -- Database schema
COUNT(*) AS total_columns, -- Total columns in schema
COUNT(CASE WHEN pii_risk_level <> 'NO_PII_DETECTED' THEN 1 END) AS pii_columns, -- Columns with PII
COUNT(CASE WHEN pii_risk_level = 'HIGH_PII_RISK' THEN 1 END) AS high_risk_columns,
COUNT(CASE WHEN pii_risk_level = 'MEDIUM_PII_RISK' THEN 1 END) AS medium_risk_columns,
-- Calculate PII density percentage for each schema
CAST(COUNT(CASE WHEN pii_risk_level <> 'NO_PII_DETECTED' THEN 1 END) * 100.0 / COUNT(*) AS DECIMAL(5,1)) AS pii_percentage
FROM #ColumnLevelPII
GROUP BY schema_name
ORDER BY high_risk_columns DESC, medium_risk_columns DESC; -- Highest risk schemas first
-- ========================================================================
-- REPORT 4: COMPLETE PII INVENTORY FOR DATA CATALOG
-- ========================================================================
-- Comprehensive listing of all PII columns for data governance tools
SELECT
'DETAILED PII INVENTORY FOR DATA CATALOG' AS report_section,
full_column_name, -- Complete column identifier
data_type, -- Technical data type information
CASE WHEN is_nullable = 1 THEN 'YES' ELSE 'NO' END AS nullable, -- NULL constraint info
pii_category, -- Specific PII type
pii_risk_level, -- Risk assessment
data_sensitivity, -- Business sensitivity level
recommended_action -- Specific governance action
FROM #ColumnLevelPII
WHERE pii_risk_level <> 'NO_PII_DETECTED' -- Only show columns with detected PII
ORDER BY
-- Prioritize by risk level for action planning
CASE pii_risk_level
WHEN 'HIGH_PII_RISK' THEN 1
WHEN 'MEDIUM_PII_RISK' THEN 2
WHEN 'LOW_PII_RISK' THEN 3
ELSE 4
END,
schema_name, table_name, column_name;
-- ========================================================================
-- COMPLETION STATUS AND SUMMARY STATISTICS
-- ========================================================================
-- Provides final summary and guidance for next steps
SELECT
'COLUMN-LEVEL ANALYSIS COMPLETED' AS status,
'Results stored in: #ColumnLevelPII (session scope)' AS temp_table_location,
'Total columns analyzed: ' + CAST((SELECT COUNT(*) FROM #ColumnLevelPII) AS VARCHAR(10)) AS total_columns,
'PII columns detected: ' + CAST((SELECT COUNT(*) FROM #ColumnLevelPII WHERE pii_risk_level <> 'NO_PII_DETECTED') AS VARCHAR(10)) AS pii_columns_found;
-- ============================================================================
-- USAGE NOTES AND RECOMMENDATIONS
-- ============================================================================
-- 1. Review high-risk columns immediately for security controls
-- 2. Validate PII detection results manually - heuristic matching has limitations
-- 3. Use results to populate data catalogs and governance tools
-- 4. Implement recommended actions based on risk levels and compliance requirements
-- 5. Schedule regular re-analysis to catch schema changes and new PII
-- 6. Consider data lineage analysis for complete PII flow mapping
-- ============================================================================
Results
Again alot to go through and check ..!
The script will generate 6 tmp tables:
Results 1 - High Risk Columns requiring immediate attention
Results 1 (2) - Distribution by Category
Results 1 (3) - Risk by Schema
Results 1 (4) - Detailed PII Inventory
Results 1 (5) - Analysis completed details

Take a look at:
PII Summary - Schema
The schema-level analysis of AdventureWorks2022 reveals that 13.4% of all database columns (65 out of 486) contain personally identifiable information, with significant concentration in the Person schema (52.9% PII coverage) and HumanResources schema (40.0% PII coverage). Critical security concerns are identified in the Sales schema, which contains 5 of the 7 total high-risk columns despite having only 5.8% overall PII coverage.
Two schemas require immediate attention: Person schema due to its high PII density and Sales schema due to its concentration of high-risk financial data. The remaining schemas (Production and Purchasing) have minimal PII exposure, with less than 5% PII coverage each.
Results 1 (3) - Risk by Schema
KEY CONCERNS:
Massive Data Exposure Volume:
Person Schema: 37 PII columns affecting ~20,000 person records (52.9% of schema is PII)
Sales Schema: 5 HIGH_PII_RISK columns in financial data affecting ~19,000+ credit records
HumanResources Schema: 16 PII columns affecting 290+ employee records (40% of schema is PII)
Total Database Impact: 65 PII columns across 486 total columns (13.4% PII coverage)
Risk Concentration Issues:
71% of all HIGH_PII_RISK columns are in Sales schema (5 of 7 total)
Password table contains unencrypted or poorly protected authentication data
Credit card data spread across multiple tables (CreditCard, PersonCreditCard, SalesOrderHeader)
32 columns marked as POTENTIAL_PII requiring immediate review
Classification Gaps Identified:
Sales.SalesTaxRate.SalesTaxRateID incorrectly marked as FINANCIAL/HIGH_PII_RISK
29 columns have no PII category assigned despite being in PII-heavy schemas
ModifiedDate columns throughout Person schema marked as POTENTIAL_PII (false positives)
SECURITY & COMPLIANCE RISKS:
Immediate Security Threats:
Password table with 20K+ records needs immediate encryption audit - AUTHENTICATION BREACH RISK
Credit card data in multiple tables - PCI-DSS VIOLATION RISK
Email addresses for 20K persons - PHISHING/SPAM TARGET RISK
Employee personal data - INSIDER THREAT/HR BREACH RISK
Data Concentration Vulnerabilities:
Person schema is 52.9% PII - SINGLE POINT OF FAILURE
Sales financial data not properly isolated - LATERAL MOVEMENT RISK
No apparent data masking in place - DEV/TEST ENVIRONMENT EXPOSURE
Access Control Gaps:
No indication of column-level security implementation
Recommended actions show "ENCRYPT" for only 6 columns (should be more)
"RESTRICT" recommended for only 9 columns (inadequate for volume)
DATA GOVERNANCE GAPS:
PII Detection Logic Failures:
Critical miss: Some customer-related tables may have undetected PII
Inconsistent classification between risk levels and sensitivity ratings
32 columns requiring manual review (50% of all PII columns) - indicates weak automated detection
False positives: ModifiedDate columns incorrectly flagged as POTENTIAL_PII
Remediation Tracking Issues:
Recommended actions appear incomplete or truncated
No versioning or audit trail for PII classification changes
No clear ownership assignment for remediation tasks
Monitoring Deficiencies:
Only 7 columns marked for "MONITOR" action (should be all HIGH/MEDIUM risk)
No real-time alerting mentioned for high-risk data access
No indication of existing DLP (Data Loss Prevention) controls
COMPLIANCE EXPOSURE:
GDPR/CCPA/PIPEDA - EXTREME RISK:
All Person schema data (37 columns, ~20K records) falls under privacy regulations
No evidence of:
Right to deletion implementation
Data portability capabilities
Consent management
Cross-border transfer controls
Potential fines: Up to 4% of global revenue (GDPR)
PCI-DSS - CRITICAL NON-COMPLIANCE:
5 HIGH_PII_RISK financial columns identified
Credit card data in Sales.CreditCard, Sales.PersonCreditCard, Sales.SalesOrderHeader
Missing requirements:
Encryption at rest (only 6 columns marked for encryption)
Network segmentation
Access logging
Risk: Loss of payment processing capabilities
Employment Law - HIGH RISK:
HumanResources.Employee with 290 records
HumanResources.JobCandidate data exposed
Missing controls:
Role-based access
Retention policies
Cross-border restrictions
Risk: Employment litigation, regulatory penalties
SOX Compliance (if applicable):
Financial data in Sales schema
No audit trail mentioned
Missing segregation of duties
Risk: Executive personal liability
x
Detailed PII Inventory
The PII detection scan of the AdventureWorks2022 database has identified significant personally identifiable information across 65 columns in multiple schemas. The analysis reveals that approximately 55% of scanned columns contain confirmed or potential PII data, with 7 columns classified as high-risk requiring immediate attention for encryption and access controls.
Key findings indicate that the Person and HumanResources schemas contain the majority of sensitive data, with authentication credentials and financial information representing the highest risk categories. The organization should prioritize implementing encryption for high-risk columns and establishing comprehensive data governance controls for medium-risk data elements.
Take a look at:
Results 1 (4) - Detailed PII Inventory
Here's a summary of the analysis:
CRITICAL:
Password table with ~20,000 records containing authentication credentials (HIGH_PII_RISK)
Credit card data across multiple Sales tables affecting ~19,000+ records
Person schema is 52.9% PII - massive single point of failure
71% of all high-risk columns concentrated in Sales schema
High-Risk PII Columns Requiring Immediate Action
The following columns contain highly sensitive data requiring immediate implementation of encryption and access controls:
Authentication Data:
Person.Password.PasswordHash - User password hashes
Person.Password.PasswordSalt - Password salt values
Financial Data:
Sales.CreditCard.CreditCardID - Credit card identifiers
Sales.PersonCreditCard.CreditCardID - Person-credit card associations
Sales.SalesOrderHeader.CreditCardApprovalCode - Credit approval codes
Sales.SalesOrderHeader.CreditCardID - Order credit card references
Sales.SalesTaxRate.SalesTaxRateID - Tax rate identifiers
MAJOR RISKS IDENTIFIED:
Authentication breach risk from exposed password hashes/salts
PCI-DSS violations from credit card data exposure
GDPR non-compliance affecting ~20,000 person records
Potential financial impact: $15M-$75M in breach costs and fines
IMMEDIATE ACTIONS (First 24 hours):
Freeze access to Person.Password table
Audit encryption status of all authentication data
Identify all systems accessing credit card tables
Enable comprehensive audit logging
x
x
x
Last updated
Was this helpful?