This comprehensive workshop implements Trust Score and Sensitivity Classification calculations at three distinct levels: Schema, Table, and Column. Each level provides unique insights for different stakeholder needs and governance requirements.
Table Level: Asset-specific trust and sensitivity scores
Column Level: Granular PII detection and field-level classification
By the end of this workshop, you will be able to:
Calculate Trust Scores (0-100) and Sensitivity Levels (HIGH/MEDIUM/LOW) at all three levels
Implement hierarchical data governance with drill-down capabilities
Generate executive dashboards with schema-level KPIs
Provide detailed column-level PII analysis for compliance
Create actionable insights for each organizational level
Why does this matter?
Data classification and sensitivity analysis are fundamental pillars of the Data Discovery phase because they establish the foundation for data governance, security, and compliance throughout your entire data ecosystem. Here's why they're so critical:
Security Risk Management: Without understanding what sensitive data exists in your AdventureWorks2022 database, you can't properly protect it. Personal information, financial data, and business-critical information each require different security controls.
Regulatory Compliance: Classification helps ensure you're meeting requirements like GDPR, CCPA, HIPAA, or industry-specific regulations. You need to know where personal data, health information, or payment card data resides to implement proper controls.
Data Governance: Classification creates a common language across your organization for discussing data assets. It enables consistent policies for data access, retention, and usage.
Access Control: You can't implement least-privilege access without knowing what data is sensitive. Classification drives your authorization models and data access policies.
Classification and Sensitivity Analysis
Classification & Sensitivity Analysis automatically evaluates and categorizes data elements based on their content, risk level, and business importance. This analysis provides the intelligence needed to implement appropriate governance controls and protect sensitive information.
Key Components
Data Classification uses pattern matching to identify data types (PII, financial, health) and assigns risk levels (Critical, High, Medium, Low). It maps data to applicable regulations like GDPR, CCPA, and HIPAA.
Business Context Analysis groups data by function (Sales, Customer, Operations) and assesses business criticality and lifecycle stage.
Trust Score Calculation
The Trust Score (0-100) measures data reliability through a weighted formula:
Business Value (30%) - Inferred importance based on schema and naming patterns
Data Type Appropriateness (25%) - How well the data type matches the column's purpose
Constraints (25%) - Coverage of integrity controls (PK, FK, NOT NULL, defaults, checks)
Naming Quality (20%) - Adherence to naming conventions and descriptiveness
The Schema Sensitivity & Trust Scores can now be set.
Schema
Sensitivity Level
Trust Score
Sales
HIGH
65
Person
HIGH
63
Production
MEDIUM
59
HumanResources
HIGH
48
Purchasing
HIGH
46
x
-- ============================================================================
-- MULTI-LEVEL DATA GOVERNANCE ANALYSIS FOR ADVENTUREWORKS2022
-- Schema Level | Table Level | Column Level
-- ============================================================================
USE AdventureWorks2022;
-- ============================================================================
-- SECTION 1: SCHEMA-LEVEL ANALYSIS
-- ============================================================================
-- PURPOSE: Calculate Trust Scores and Sensitivity Classifications at the schema level
-- SCOPE: Provides executive-level KPIs for departmental governance oversight
-- OUTPUT: Schema-level metrics for Pentaho Data Catalog domain management
-- ============================================================================
-- Clean up any existing temp tables to ensure fresh analysis
-- This prevents conflicts if the script is run multiple times in the same session
IF OBJECT_ID('tempdb..#SchemaLevelAnalysis') IS NOT NULL DROP TABLE #SchemaLevelAnalysis;
IF OBJECT_ID('tempdb..#TableLevelAnalysis') IS NOT NULL DROP TABLE #TableLevelAnalysis;
IF OBJECT_ID('tempdb..#ColumnLevelAnalysis') IS NOT NULL DROP TABLE #ColumnLevelAnalysis;
PRINT 'Starting Multi-Level Data Governance Analysis...';
PRINT 'Analysis timestamp: ' + CAST(GETDATE() AS VARCHAR(50));
-- ============================================================================
-- SCHEMA-LEVEL RESULTS TABLE DEFINITION
-- ============================================================================
-- This table stores comprehensive schema-level KPIs that provide executive oversight
-- and departmental governance metrics suitable for Pentaho Data Catalog integration
CREATE TABLE #SchemaLevelAnalysis (
-- ========================================================================
-- PRIMARY KEY AND IDENTIFICATION
-- ========================================================================
schema_id INT IDENTITY(1,1) PRIMARY KEY, -- Unique identifier for each schema record
schema_name NVARCHAR(128) NOT NULL, -- Database schema name (Person, Sales, etc.)
-- ========================================================================
-- FOUNDATIONAL SCHEMA METRICS
-- ========================================================================
-- These metrics provide the basic scale and scope of each schema
total_tables INT NOT NULL, -- Count of user tables in this schema
total_columns INT NOT NULL, -- Count of all columns across schema tables
total_estimated_rows BIGINT, -- Sum of estimated rows across all tables
total_size_mb DECIMAL(12,2), -- Total storage consumption in megabytes
-- ========================================================================
-- SCHEMA-LEVEL TRUST SCORE (0-100 Scale)
-- ========================================================================
-- Trust Score Components:
-- • Data Volume & Structure Health (30%): Scale and organization quality
-- • Usage Patterns (40%): How actively the schema is accessed
-- • Business Context & Organization (20%): Clarity of business purpose
-- • Data Governance Readiness (10%): Documentation and naming standards
schema_trust_score DECIMAL(5,2) NOT NULL, -- Calculated trust score (0-100)
schema_trust_level VARCHAR(20) NOT NULL, -- Categorical trust level (VERY_HIGH, HIGH, MEDIUM, LOW, VERY_LOW)
-- ========================================================================
-- SCHEMA-LEVEL SENSITIVITY CLASSIFICATION
-- ========================================================================
-- Sensitivity levels determined by:
-- • PII content density and risk level
-- • Business function criticality (HR, Sales, etc.)
-- • Regulatory compliance requirements
schema_sensitivity_level VARCHAR(10) NOT NULL, -- HIGH/MEDIUM/LOW classification
schema_sensitivity_score INT NOT NULL, -- Numerical sensitivity score (0-100)
-- ========================================================================
-- DATA QUALITY AND RISK METRICS
-- ========================================================================
-- These metrics help assess data reliability and governance needs
avg_completeness_pct DECIMAL(5,2), -- Average completeness across schema tables
pii_density_pct DECIMAL(5,2), -- Percentage of columns containing PII
high_risk_table_count INT, -- Count of tables with critical PII exposure
-- ========================================================================
-- BUSINESS CONTEXT FOR GOVERNANCE
-- ========================================================================
-- Maps technical schemas to business functions for stakeholder communication
business_domain VARCHAR(100), -- Business function description
data_owner_department VARCHAR(50), -- Responsible department/team
governance_priority VARCHAR(20), -- Priority level for governance attention
-- ========================================================================
-- COMPLIANCE AND REGULATORY REQUIREMENTS
-- ========================================================================
-- Helps legal and compliance teams understand regulatory exposure
regulatory_frameworks NVARCHAR(500), -- Applicable regulations (GDPR, CCPA, etc.)
encryption_required_tables INT, -- Count of tables requiring encryption
-- ========================================================================
-- AUDIT AND TRACKING
-- ========================================================================
analysis_timestamp DATETIME DEFAULT GETDATE() -- When this analysis was performed
);
-- ============================================================================
-- SCHEMA-LEVEL DATA COLLECTION AND CALCULATION
-- ============================================================================
-- This section uses Common Table Expressions (CTEs) to gather and calculate
-- the various components needed for schema-level trust scores and sensitivity
-- ============================================================================
-- CTE 1: FOUNDATIONAL SCHEMA METRICS
-- ============================================================================
-- Collects basic statistical information about each schema's structure and size
-- This forms the foundation for trust score calculations
WITH SchemaMetrics AS (
SELECT
s.schema_id, -- Schema system identifier
s.name AS schema_name, -- Human-readable schema name
-- STRUCTURAL METRICS: Count tables and columns for scope assessment
COUNT(DISTINCT t.object_id) AS table_count, -- Number of user tables (excludes views, system tables)
COUNT(c.column_id) AS column_count, -- Total columns across all tables in schema
-- DATA VOLUME METRICS: Assess scale and business importance
COALESCE(SUM(p.rows), 0) AS total_rows, -- Sum of estimated rows across all tables
-- STORAGE METRICS: Calculate total space consumption
-- This complex subquery calculates actual storage used by summing allocation pages
COALESCE(SUM(
(SELECT SUM(a.total_pages) * 8 / 1024.0 -- Convert 8KB pages to MB
FROM sys.partitions p2
JOIN sys.allocation_units a ON p2.partition_id = a.container_id
WHERE p2.object_id = t.object_id -- Match specific table
AND p2.index_id IN (0,1)) -- Include heap (0) and clustered index (1) data
), 0) AS total_size_mb
FROM sys.schemas s -- All database schemas
LEFT JOIN sys.tables t ON s.schema_id = t.schema_id AND t.type = 'U' -- Only user tables
LEFT JOIN sys.columns c ON t.object_id = c.object_id -- All columns in user tables
LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0,1) -- Row count estimates
GROUP BY s.schema_id, s.name -- Group by schema for aggregation
),
-- CTE 2: PII DETECTION AND RISK ASSESSMENT
-- ============================================================================
-- Analyzes column names to detect potential Personally Identifiable Information (PII)
-- Uses pattern matching to classify PII risk levels for compliance assessment
SchemaPIIAnalysis AS (
SELECT
s.schema_id,
-- HIGH-RISK PII DETECTION
-- These patterns indicate the most sensitive data requiring strongest protection
COUNT(CASE
WHEN c.name LIKE '%ssn%' OR c.name LIKE '%social%' OR -- Social Security Numbers
c.name LIKE '%password%' OR c.name LIKE '%credit%' OR -- Authentication & Financial
c.name LIKE '%account%number%' OR c.name LIKE '%license%' OR -- Account Numbers & IDs
c.name LIKE '%passport%' -- Government IDs
THEN 1
END) AS high_pii_columns,
-- MEDIUM-RISK PII DETECTION
-- Personal identifiers that require significant protection but are less critical than high-risk
COUNT(CASE
WHEN c.name LIKE '%first%name%' OR c.name LIKE '%last%name%' OR -- Personal names
c.name LIKE '%email%' OR c.name LIKE '%phone%' OR -- Contact information
c.name LIKE '%address%' OR c.name LIKE '%birth%' OR -- Location & demographic
c.name LIKE '%salary%' OR c.name LIKE '%wage%' -- Financial compensation
THEN 1
END) AS medium_pii_columns,
-- LOW-RISK PII DETECTION
-- Demographic information that may have privacy implications but lower risk
COUNT(CASE
WHEN c.name LIKE '%gender%' OR c.name LIKE '%age%' OR -- Basic demographics
c.name LIKE '%title%' OR c.name LIKE '%nationality%' -- Professional/cultural info
THEN 1
END) AS low_pii_columns,
-- TOTAL COLUMN COUNT for calculating PII density percentages
COUNT(c.column_id) AS total_columns
FROM sys.schemas s
LEFT JOIN sys.tables t ON s.schema_id = t.schema_id AND t.type = 'U' -- Only user tables
LEFT JOIN sys.columns c ON t.object_id = c.object_id -- All columns
GROUP BY s.schema_id
),
-- CTE 3: USAGE PATTERN ANALYSIS
-- ============================================================================
-- Analyzes how actively each schema is being used based on SQL Server statistics
-- Usage patterns are a key indicator of data trust and business value
SchemaUsagePatterns AS (
SELECT
s.schema_id,
-- AVERAGE USAGE CALCULATION
-- Combines seeks (index lookups), scans (table scans), and lookups (key lookups)
AVG(COALESCE(us.user_seeks + us.user_scans + us.user_lookups, 0)) AS avg_usage,
-- RECENT ACCESS ANALYSIS
-- Counts tables that have been accessed within the last 30 days
-- This indicates active vs. dormant data
COUNT(CASE
WHEN DATEDIFF(day, COALESCE(us.last_user_seek, us.last_user_scan, us.last_user_lookup, '1900-01-01'), GETDATE()) <= 30
THEN 1
END) AS recently_accessed_tables,
-- TOTAL TABLE COUNT for calculating percentage of active tables
COUNT(t.object_id) AS total_tables
FROM sys.schemas s
LEFT JOIN sys.tables t ON s.schema_id = t.schema_id AND t.type = 'U' -- Only user tables
-- LEFT JOIN to usage stats allows for tables with no recorded usage (new or unused tables)
LEFT JOIN sys.dm_db_index_usage_stats us ON t.object_id = us.object_id AND us.index_id IN (0,1)
GROUP BY s.schema_id
)
-- ============================================================================
-- MAIN INSERT STATEMENT: SCHEMA-LEVEL KPI CALCULATION
-- ============================================================================
-- This section combines all the CTE data to calculate final schema-level KPIs
-- The calculations use weighted algorithms to produce meaningful business metrics
INSERT INTO #SchemaLevelAnalysis (
schema_name, total_tables, total_columns, total_estimated_rows, total_size_mb,
schema_trust_score, schema_trust_level, schema_sensitivity_level, schema_sensitivity_score,
avg_completeness_pct, pii_density_pct, high_risk_table_count,
business_domain, data_owner_department, governance_priority,
regulatory_frameworks, encryption_required_tables
)
SELECT
sm.schema_name, -- Schema identifier for reporting
sm.table_count AS total_tables, -- Count of user tables in schema
sm.column_count AS total_columns, -- Total columns across all schema tables
sm.total_rows AS total_estimated_rows, -- Sum of estimated rows across tables
sm.total_size_mb, -- Total storage consumption in MB
-- ========================================================================
-- SCHEMA-LEVEL TRUST SCORE CALCULATION (0-100 Scale)
-- ========================================================================
-- Trust Score Algorithm combines four weighted components:
-- 1. Data Volume & Structure Health (30%): Measures schema scale and organization
-- 2. Usage Patterns (40%): Indicates business value through access frequency
-- 3. Business Context & Organization (20%): Assesses clarity of business purpose
-- 4. Data Governance Readiness (10%): Evaluates documentation and standards
CAST(
-- COMPONENT 1: DATA VOLUME & STRUCTURE HEALTH (30% of total score)
-- Larger, well-populated schemas typically indicate higher business value
-- and organizational investment, contributing to trustworthiness
(CASE
WHEN sm.table_count > 10 AND sm.total_rows > 50000 THEN 30.0 -- Large, well-populated schema
WHEN sm.table_count > 5 AND sm.total_rows > 10000 THEN 25.0 -- Medium-sized operational schema
WHEN sm.table_count > 2 AND sm.total_rows > 1000 THEN 20.0 -- Small but active schema
WHEN sm.table_count > 0 THEN 15.0 -- Minimal schema with some data
ELSE 5.0 -- Empty or nearly empty schema
END) +
-- COMPONENT 2: USAGE PATTERNS (40% of total score)
-- High usage frequency and recent access indicate active business reliance
-- This is the highest weighted component as usage reflects real business value
(CASE
-- Excellent usage: High frequency + all tables recently accessed
WHEN sup.avg_usage > 1000 AND sup.recently_accessed_tables = sup.total_tables THEN 40.0
-- Good usage: Regular access + most tables active
WHEN sup.avg_usage > 100 AND sup.recently_accessed_tables >= sup.total_tables * 0.7 THEN 30.0
-- Fair usage: Some access + half the tables active
WHEN sup.avg_usage > 10 AND sup.recently_accessed_tables >= sup.total_tables * 0.5 THEN 20.0
-- Low usage: Minimal access but some activity
WHEN sup.recently_accessed_tables > 0 THEN 15.0
-- No usage: No recorded access (potential dormant data)
ELSE 5.0
END) +
-- COMPONENT 3: BUSINESS CONTEXT & ORGANIZATION (20% of total score)
-- Well-defined business domains indicate mature data organization
-- and clear data ownership, improving trust and governance
(CASE sm.schema_name
WHEN 'Person' THEN 18.0 -- Customer data: well-organized, critical business function
WHEN 'Sales' THEN 20.0 -- Revenue operations: highest business priority
WHEN 'Production' THEN 16.0 -- Manufacturing: clear operational purpose
WHEN 'HumanResources' THEN 18.0 -- HR data: well-defined domain with clear ownership
WHEN 'Purchasing' THEN 15.0 -- Procurement: defined business function
ELSE 10.0 -- Other schemas: unclear business purpose
END) +
-- COMPONENT 4: DATA GOVERNANCE READINESS (10% of total score)
-- Schemas with clear business purpose and established patterns
-- are more ready for governance frameworks and policy application
(CASE
-- High readiness: Known business-critical schemas
WHEN sm.schema_name IN ('Person', 'Sales', 'HumanResources') THEN 10.0
-- Medium readiness: Multiple tables suggest established use
WHEN sm.table_count > 5 THEN 8.0
-- Basic readiness: Some structure exists
WHEN sm.table_count > 0 THEN 6.0
-- Low readiness: Minimal or unclear structure
ELSE 2.0
END)
AS DECIMAL(5,2)) AS schema_trust_score,
-- ========================================================================
-- SCHEMA TRUST LEVEL CLASSIFICATION
-- ========================================================================
-- Converts numerical trust scores into categorical levels for easier interpretation
-- These categories help executives and managers quickly understand schema reliability
CASE
WHEN (
-- Recalculate the same trust score for comparison against thresholds
-- This ensures consistent classification based on the calculated score
(CASE
WHEN sm.table_count > 10 AND sm.total_rows > 50000 THEN 30.0
WHEN sm.table_count > 5 AND sm.total_rows > 10000 THEN 25.0
WHEN sm.table_count > 2 AND sm.total_rows > 1000 THEN 20.0
WHEN sm.table_count > 0 THEN 15.0
ELSE 5.0
END) +
(CASE
WHEN sup.avg_usage > 1000 AND sup.recently_accessed_tables = sup.total_tables THEN 40.0
WHEN sup.avg_usage > 100 AND sup.recently_accessed_tables >= sup.total_tables * 0.7 THEN 30.0
WHEN sup.avg_usage > 10 AND sup.recently_accessed_tables >= sup.total_tables * 0.5 THEN 20.0
WHEN sup.recently_accessed_tables > 0 THEN 15.0
ELSE 5.0
END) +
(CASE sm.schema_name
WHEN 'Person' THEN 18.0
WHEN 'Sales' THEN 20.0
WHEN 'Production' THEN 16.0
WHEN 'HumanResources' THEN 18.0
WHEN 'Purchasing' THEN 15.0
ELSE 10.0
END) +
(CASE
WHEN sm.schema_name IN ('Person', 'Sales', 'HumanResources') THEN 10.0
WHEN sm.table_count > 5 THEN 8.0
WHEN sm.table_count > 0 THEN 6.0
ELSE 2.0
END)
) >= 80 THEN 'VERY_HIGH' -- 80-100: Excellent reliability, high business value
WHEN (
(CASE
WHEN sm.table_count > 10 AND sm.total_rows > 50000 THEN 30.0
WHEN sm.table_count > 5 AND sm.total_rows > 10000 THEN 25.0
WHEN sm.table_count > 2 AND sm.total_rows > 1000 THEN 20.0
WHEN sm.table_count > 0 THEN 15.0
ELSE 5.0
END) +
(CASE
WHEN sup.avg_usage > 1000 AND sup.recently_accessed_tables = sup.total_tables THEN 40.0
WHEN sup.avg_usage > 100 AND sup.recently_accessed_tables >= sup.total_tables * 0.7 THEN 30.0
WHEN sup.avg_usage > 10 AND sup.recently_accessed_tables >= sup.total_tables * 0.5 THEN 20.0
WHEN sup.recently_accessed_tables > 0 THEN 15.0
ELSE 5.0
END) +
(CASE sm.schema_name
WHEN 'Person' THEN 18.0
WHEN 'Sales' THEN 20.0
WHEN 'Production' THEN 16.0
WHEN 'HumanResources' THEN 18.0
WHEN 'Purchasing' THEN 15.0
ELSE 10.0
END) +
(CASE
WHEN sm.schema_name IN ('Person', 'Sales', 'HumanResources') THEN 10.0
WHEN sm.table_count > 5 THEN 8.0
WHEN sm.table_count > 0 THEN 6.0
ELSE 2.0
END)
) >= 65 THEN 'HIGH' -- 65-79: Good reliability, suitable for business use
WHEN (
(CASE
WHEN sm.table_count > 10 AND sm.total_rows > 50000 THEN 30.0
WHEN sm.table_count > 5 AND sm.total_rows > 10000 THEN 25.0
WHEN sm.table_count > 2 AND sm.total_rows > 1000 THEN 20.0
WHEN sm.table_count > 0 THEN 15.0
ELSE 5.0
END) +
(CASE
WHEN sup.avg_usage > 1000 AND sup.recently_accessed_tables = sup.total_tables THEN 40.0
WHEN sup.avg_usage > 100 AND sup.recently_accessed_tables >= sup.total_tables * 0.7 THEN 30.0
WHEN sup.avg_usage > 10 AND sup.recently_accessed_tables >= sup.total_tables * 0.5 THEN 20.0
WHEN sup.recently_accessed_tables > 0 THEN 15.0
ELSE 5.0
END) +
(CASE sm.schema_name
WHEN 'Person' THEN 18.0
WHEN 'Sales' THEN 20.0
WHEN 'Production' THEN 16.0
WHEN 'HumanResources' THEN 18.0
WHEN 'Purchasing' THEN 15.0
ELSE 10.0
END) +
(CASE
WHEN sm.schema_name IN ('Person', 'Sales', 'HumanResources') THEN 10.0
WHEN sm.table_count > 5 THEN 8.0
WHEN sm.table_count > 0 THEN 6.0
ELSE 2.0
END)
) >= 50 THEN 'MEDIUM' -- 50-64: Moderate reliability, may need improvement
WHEN (
(CASE
WHEN sm.table_count > 10 AND sm.total_rows > 50000 THEN 30.0
WHEN sm.table_count > 5 AND sm.total_rows > 10000 THEN 25.0
WHEN sm.table_count > 2 AND sm.total_rows > 1000 THEN 20.0
WHEN sm.table_count > 0 THEN 15.0
ELSE 5.0
END) +
(CASE
WHEN sup.avg_usage > 1000 AND sup.recently_accessed_tables = sup.total_tables THEN 40.0
WHEN sup.avg_usage > 100 AND sup.recently_accessed_tables >= sup.total_tables * 0.7 THEN 30.0
WHEN sup.avg_usage > 10 AND sup.recently_accessed_tables >= sup.total_tables * 0.5 THEN 20.0
WHEN sup.recently_accessed_tables > 0 THEN 15.0
ELSE 5.0
END) +
(CASE sm.schema_name
WHEN 'Person' THEN 18.0
WHEN 'Sales' THEN 20.0
WHEN 'Production' THEN 16.0
WHEN 'HumanResources' THEN 18.0
WHEN 'Purchasing' THEN 15.0
ELSE 10.0
END) +
(CASE
WHEN sm.schema_name IN ('Person', 'Sales', 'HumanResources') THEN 10.0
WHEN sm.table_count > 5 THEN 8.0
WHEN sm.table_count > 0 THEN 6.0
ELSE 2.0
END)
) >= 35 THEN 'LOW' -- 35-49: Low reliability, significant improvement needed
ELSE 'VERY_LOW' -- 0-34: Very poor reliability, major issues exist
END AS schema_trust_level,
-- ========================================================================
-- SCHEMA-LEVEL SENSITIVITY CLASSIFICATION (HIGH/MEDIUM/LOW)
-- ========================================================================
-- Determines data sensitivity based on PII content, business function, and regulatory risk
-- This classification drives security controls, access policies, and compliance requirements
CASE
-- HIGH SENSITIVITY: Contains critical PII or operates in highly regulated business areas
-- Triggers strongest security controls and compliance monitoring
WHEN spa.high_pii_columns > 0 OR -- Any government IDs, financial data, authentication
sm.schema_name IN ('Person', 'HumanResources') OR -- Known sensitive business domains
(spa.medium_pii_columns > 0 AND spa.medium_pii_columns * 100.0 / NULLIF(spa.total_columns, 0) > 20) -- High PII density (>20%)
THEN 'HIGH'
-- MEDIUM SENSITIVITY: Contains some PII or handles business-critical operations
-- Requires standard security controls and regular governance oversight
WHEN spa.medium_pii_columns > 0 OR -- Any personal identifiers detected
sm.schema_name IN ('Sales', 'Production') AND sm.total_rows > 10000 OR -- Large business-critical datasets
sm.total_rows > 50000 -- Large datasets regardless of domain
THEN 'MEDIUM'
-- LOW SENSITIVITY: Reference data, system schemas, or minimal business impact
-- Standard governance controls sufficient
ELSE 'LOW'
END AS schema_sensitivity_level,
-- Schema Sensitivity Score (0-100)
CASE
-- HIGH sensitivity scoring (70-100)
WHEN spa.high_pii_columns > 0 THEN 85 + LEAST(spa.high_pii_columns * 3, 15)
WHEN sm.schema_name = 'Person' THEN 80 + (spa.medium_pii_columns * 2)
WHEN sm.schema_name = 'HumanResources' THEN 78 + (spa.medium_pii_columns * 2)
WHEN (spa.medium_pii_columns * 100.0 / NULLIF(spa.total_columns, 0)) > 20 THEN 72 + (spa.medium_pii_columns)
-- MEDIUM sensitivity scoring (40-69)
WHEN spa.medium_pii_columns > 0 THEN 45 + (spa.medium_pii_columns * 3) + (spa.low_pii_columns)
WHEN sm.schema_name = 'Sales' AND sm.total_rows > 10000 THEN 55
WHEN sm.schema_name = 'Sales' THEN 48
WHEN sm.schema_name IN ('Production', 'Purchasing') AND sm.total_rows > 10000 THEN 50
WHEN sm.schema_name IN ('Production', 'Purchasing') THEN 42
WHEN sm.total_rows > 50000 THEN 45
-- LOW sensitivity scoring (5-39)
WHEN spa.low_pii_columns > 0 THEN 25 + (spa.low_pii_columns * 2)
WHEN sm.total_rows > 1000 THEN 20
WHEN sm.table_count > 0 THEN 15
ELSE 5
END AS schema_sensitivity_score,
-- Quality Metrics
CAST(
CASE
WHEN sm.column_count > 0
THEN (
SELECT AVG(CAST(
COUNT(CASE WHEN c.is_nullable = 0 THEN 1 END) * 100.0 / COUNT(*)
AS DECIMAL(5,2)))
FROM sys.tables t2
JOIN sys.columns c ON t2.object_id = c.object_id
WHERE t2.schema_id = sm.schema_id AND t2.type = 'U'
GROUP BY t2.object_id
)
ELSE 0
END
AS DECIMAL(5,2)) AS avg_completeness_pct,
-- PII Density Percentage
CAST(
CASE
WHEN spa.total_columns > 0
THEN (spa.high_pii_columns + spa.medium_pii_columns + spa.low_pii_columns) * 100.0 / spa.total_columns
ELSE 0
END
AS DECIMAL(5,2)) AS pii_density_pct,
-- High Risk Table Count (tables with high PII + large data volume)
(SELECT COUNT(*)
FROM sys.tables t3
WHERE t3.schema_id = sm.schema_id AND t3.type = 'U'
AND EXISTS (
SELECT 1 FROM sys.columns c3
WHERE c3.object_id = t3.object_id
AND (c3.name LIKE '%ssn%' OR c3.name LIKE '%password%' OR c3.name LIKE '%credit%')
)
) AS high_risk_table_count,
-- Business Context
CASE sm.schema_name
WHEN 'Person' THEN 'Customer & Identity Management'
WHEN 'Sales' THEN 'Revenue Operations & Customer Relations'
WHEN 'Production' THEN 'Manufacturing & Inventory Operations'
WHEN 'HumanResources' THEN 'Employee Lifecycle Management'
WHEN 'Purchasing' THEN 'Vendor Relations & Procurement'
ELSE 'System & Reference Data'
END AS business_domain,
CASE sm.schema_name
WHEN 'Person' THEN 'Customer Service'
WHEN 'Sales' THEN 'Sales Operations'
WHEN 'HumanResources' THEN 'Human Resources'
WHEN 'Production' THEN 'Operations'
WHEN 'Purchasing' THEN 'Procurement'
ELSE 'IT Administration'
END AS data_owner_department,
-- Governance Priority
CASE
WHEN spa.high_pii_columns > 0 OR sm.schema_name IN ('Person', 'HumanResources') THEN 'CRITICAL'
WHEN spa.medium_pii_columns > 0 OR sm.schema_name = 'Sales' THEN 'HIGH'
WHEN sm.total_rows > 10000 THEN 'MEDIUM'
ELSE 'LOW'
END AS governance_priority,
-- Regulatory Frameworks
CASE
WHEN spa.high_pii_columns > 0 OR sm.schema_name = 'Person'
THEN 'GDPR, CCPA, PIPEDA, PII Protection Laws, Data Breach Notification'
WHEN sm.schema_name = 'HumanResources'
THEN 'GDPR, CCPA, Employment Law, Wage & Hour Regulations'
WHEN sm.schema_name = 'Sales' AND spa.medium_pii_columns > 0
THEN 'GDPR, CCPA, SOX, Financial Privacy Rules'
WHEN sm.schema_name IN ('Production', 'Purchasing')
THEN 'SOX, Industry Safety Regulations, Supply Chain Compliance'
ELSE 'Standard Corporate Data Governance'
END AS regulatory_frameworks,
-- Encryption Required Tables Count
(SELECT COUNT(*)
FROM sys.tables t4
WHERE t4.schema_id = sm.schema_id AND t4.type = 'U'
AND (
sm.schema_name IN ('Person', 'HumanResources') OR
EXISTS (
SELECT 1 FROM sys.columns c4
WHERE c4.object_id = t4.object_id
AND (c4.name LIKE '%ssn%' OR c4.name LIKE '%password%' OR c4.name LIKE '%credit%')
)
)
) AS encryption_required_tables
FROM SchemaMetrics sm
LEFT JOIN SchemaPIIAnalysis spa ON sm.schema_id = spa.schema_id
LEFT JOIN SchemaUsagePatterns sup ON sm.schema_id = sup.schema_id
WHERE sm.table_count > 0 -- Only include schemas with user tables
ORDER BY schema_trust_score DESC, schema_sensitivity_score DESC;
PRINT 'Schema-level analysis completed successfully!';
-- Display Schema-Level Results
SELECT
'SCHEMA-LEVEL KPI DASHBOARD' AS report_section,
schema_name,
-- Key Metrics
total_tables,
total_columns,
FORMAT(total_estimated_rows, 'N0') AS total_estimated_rows,
FORMAT(total_size_mb, 'N2') + ' MB' AS total_size,
-- Trust Score KPIs
schema_trust_score,
schema_trust_level,
-- Sensitivity KPIs
schema_sensitivity_level,
schema_sensitivity_score,
-- Quality KPIs
FORMAT(avg_completeness_pct, 'N1') + '%' AS avg_completeness,
FORMAT(pii_density_pct, 'N1') + '%' AS pii_density,
-- Business Context
business_domain,
data_owner_department,
governance_priority,
-- Risk Indicators
high_risk_table_count,
encryption_required_tables,
-- Compliance
regulatory_frameworks
FROM #SchemaLevelAnalysis
ORDER BY schema_trust_score DESC, schema_sensitivity_score DESC;
Table Analysis
Table-level metrics provide executive oversight and departmental governance KPIs.
The Table Sensitivity & Trust Scores can now be set.
Schema
Table
Sensitivity
Trust Score
Sales
CountryRegionCurrency
LOW
50
CreditCard
HIGH
90
Currency
LOW
50
CurrencyRate
MEDIUM
55
Customer
HIGH
90
PersonCreditCard
HIGH
90
SalesOrderDetail
MEDIUM
55
SalesOrderHeader
HIGH
100
SalesOrderHeaderSalesReason
MEDIUM
55
SalesPerson
LOW
50
SalesPersonQuotaHistory
LOW
50
SalesReason
LOW
50
SalesTaxRate
HIGH
90
SalesTerritory
LOW
50
SalesTerritoryHistory
LOW
50
ShoppingCartItem
LOW
50
SpecialOffer
LOW
50
SpecialOfferProduct
LOW
50
Store
LOW
50
Schema
Table
Sensitivity
Trust Score
Person
Address
HIGH
51
AddressType
MEDIUM
47
BusinessEntity
LOW
56
BusinessEntityAddress
HIGH
60
BusinessEntityContact
LOW
60
ContactType
LOW
52
CountryRegion
LOW
51
EmailAddress
HIGH
53
Password
HIGH
54
Person
HIGH
60
PersonPhone
HIGH
58
PhoneNumberType
MEDIUM
48
StateProvince
LOW
56
-- ============================================================================
-- TABLE-LEVEL ANALYSIS -
-- ============================================================================
-- PURPOSE: Calculate Trust Scores and Sensitivity Classifications at table level
-- SCOPE: Provides asset-specific KPIs for operational data management and governance
-- OUTPUT: Table-level metrics for Pentaho Data Catalog asset management
-- BUSINESS VALUE: Enables data stewards to prioritize governance efforts by table risk
-- DATABASE: AdventureWorks2022 (adaptable to any SQL Server database)
-- ============================================================================
USE AdventureWorks2022;
-- Clean up any existing temp table for fresh execution
IF OBJECT_ID('tempdb..#TableLevelAnalysis') IS NOT NULL DROP TABLE #TableLevelAnalysis;
PRINT '============================================================================';
PRINT 'STARTING TABLE-LEVEL DATA GOVERNANCE ANALYSIS';
PRINT '============================================================================';
PRINT 'Analysis timestamp: ' + CAST(GETDATE() AS VARCHAR(50));
PRINT 'Target database: ' + DB_NAME();
PRINT 'Integration target: Pentaho Data Catalog';
PRINT '';
-- ============================================================================
-- TABLE-LEVEL RESULTS TABLE DEFINITION
-- ============================================================================
-- This table stores comprehensive table-level KPIs that provide operational oversight
-- and asset-specific governance metrics suitable for Pentaho Data Catalog integration
CREATE TABLE #TableLevelAnalysis (
-- ========================================================================
-- PRIMARY KEY AND IDENTIFICATION
-- ========================================================================
table_id INT IDENTITY(1,1) PRIMARY KEY, -- Unique identifier for each table record
schema_name NVARCHAR(128) NOT NULL, -- Database schema name (Person, Sales, etc.)
table_name NVARCHAR(128) NOT NULL, -- Individual table name within schema
full_table_name NVARCHAR(257) NOT NULL, -- Fully qualified table name (schema.table)
-- ========================================================================
-- FOUNDATIONAL TABLE METRICS
-- ========================================================================
-- These metrics provide the basic scale and structure information for each table
column_count INT NOT NULL, -- Number of columns in this table
estimated_rows BIGINT, -- Approximate number of rows (from statistics)
table_size_mb DECIMAL(10,2), -- Storage space consumed by this table in MB
-- ========================================================================
-- TABLE-LEVEL TRUST SCORE (0-100 Scale)
-- ========================================================================
-- Trust Score Components for Tables:
-- • Data Quality (40%): Completeness, consistency, freshness of table data
-- • Usage Patterns (30%): How frequently and recently the table is accessed
-- • Technical Health (20%): Storage efficiency and structural appropriateness
-- • Business Context (10%): Schema-based business importance assessment
table_trust_score DECIMAL(5,2) NOT NULL, -- Calculated trust score (0-100)
table_trust_level VARCHAR(20) NOT NULL, -- Categorical trust level (VERY_HIGH, HIGH, MEDIUM, LOW, VERY_LOW)
-- ========================================================================
-- TABLE-LEVEL SENSITIVITY CLASSIFICATION
-- ========================================================================
-- Sensitivity levels determined by:
-- • PII column content analysis (high/medium/low risk columns)
-- • Business schema context (HR, Sales, Customer data)
-- • Data volume considerations (large datasets = higher impact)
table_sensitivity_level VARCHAR(10) NOT NULL, -- HIGH/MEDIUM/LOW classification
table_sensitivity_score INT NOT NULL, -- Numerical sensitivity score (0-100)
-- ========================================================================
-- DATA QUALITY COMPONENT METRICS
-- ========================================================================
-- Detailed breakdown of the quality factors that contribute to trust score
completeness_percentage DECIMAL(5,2), -- % of columns that are NOT NULL (data completeness)
consistency_score DECIMAL(5,2), -- % of columns with constraints/relationships (data integrity)
freshness_score DECIMAL(5,2), -- Recency score based on last modification date
-- ========================================================================
-- USAGE PATTERN ANALYTICS
-- ========================================================================
-- These metrics indicate business value through actual system usage
total_reads BIGINT, -- Total SELECT operations since SQL restart
total_writes BIGINT, -- Total INSERT/UPDATE/DELETE operations
days_since_last_access INT, -- Days since any read operation occurred
usage_frequency VARCHAR(20), -- Categorical usage level (VERY_HIGH to NO_USAGE)
-- ========================================================================
-- PII (PERSONALLY IDENTIFIABLE INFORMATION) ANALYSIS
-- ========================================================================
-- Granular breakdown of privacy-sensitive column counts for compliance
high_risk_pii_columns INT, -- Gov IDs, financial, authentication columns
medium_risk_pii_columns INT, -- Names, contact info, demographic columns
low_risk_pii_columns INT, -- General demographic, preference columns
pii_risk_assessment VARCHAR(20), -- Overall PII risk (CRITICAL to MINIMAL)
-- ========================================================================
-- BUSINESS CONTEXT AND CLASSIFICATION
-- ========================================================================
-- Maps technical tables to business functions for stakeholder communication
business_criticality VARCHAR(30), -- Business importance (MISSION_CRITICAL to MINIMAL)
data_classification VARCHAR(60), -- Data sensitivity classification for access control
-- ========================================================================
-- GOVERNANCE REQUIREMENTS AND CONTROLS
-- ========================================================================
-- Specific governance actions required based on sensitivity and risk analysis
encryption_required BIT DEFAULT 0, -- Whether table data should be encrypted
audit_logging_required BIT DEFAULT 0, -- Whether access should be logged
access_control_level VARCHAR(20), -- Required access restriction level
governance_recommendation NVARCHAR(500), -- Specific actionable recommendations
-- ========================================================================
-- AUDIT AND TRACKING
-- ========================================================================
analysis_timestamp DATETIME DEFAULT GETDATE() -- When this analysis was performed
);
PRINT 'Table structure created successfully for comprehensive KPI storage.';
-- ============================================================================
-- TABLE-LEVEL DATA COLLECTION AND CALCULATION
-- ============================================================================
-- This section uses Common Table Expressions (CTEs) to gather and calculate
-- the various components needed for table-level trust scores and sensitivity
-- Each CTE focuses on a specific aspect of table analysis for maintainability
-- ============================================================================
PRINT 'Starting data collection and calculation phase...';
-- CTE 1: FOUNDATIONAL TABLE METRICS
-- ============================================================================
-- Collects basic statistical and structural information about each table
-- This forms the foundation for trust score calculations and business context
WITH TableMetrics AS (
SELECT
t.object_id, -- SQL Server internal table identifier
s.name AS schema_name, -- Human-readable schema name
t.name AS table_name, -- Human-readable table name
-- STRUCTURAL METRICS: Understanding table composition
COUNT(c.column_id) AS column_count, -- Total columns (indicates complexity)
-- DATA VOLUME METRICS: Scale assessment for business importance
COALESCE(p.rows, 0) AS estimated_rows, -- Estimated row count from statistics
-- STORAGE METRICS: Calculate actual space consumption
-- This complex subquery calculates storage by summing 8KB allocation pages
COALESCE(
(SELECT SUM(a.total_pages) * 8 / 1024.0 -- Convert 8KB pages to megabytes
FROM sys.partitions p2 -- Table partitions (usually just one)
JOIN sys.allocation_units a ON p2.partition_id = a.container_id
WHERE p2.object_id = t.object_id -- Match this specific table
AND p2.index_id IN (0,1)) -- Include heap (0) and clustered index (1) data only
, 0) AS table_size_mb,
-- TEMPORAL METRICS: Track table lifecycle for freshness scoring
t.create_date, -- When table was originally created
t.modify_date -- Last time table structure was modified
FROM sys.schemas s -- All database schemas
JOIN sys.tables t ON s.schema_id = t.schema_id -- User tables within schemas
JOIN sys.columns c ON t.object_id = c.object_id -- All columns in each table (for counting)
LEFT JOIN sys.partitions p ON t.object_id = p.object_id AND p.index_id IN (0,1) -- Row estimates
WHERE t.type = 'U' -- Only user tables (excludes views, system tables)
GROUP BY t.object_id, s.name, t.name, p.rows, t.create_date, t.modify_date
),
-- CTE 2: PII (PERSONALLY IDENTIFIABLE INFORMATION) DETECTION
-- ============================================================================
-- Analyzes column names using pattern matching to detect potential PII
-- This is critical for GDPR, CCPA, and other privacy regulation compliance
TablePIIAnalysis AS (
SELECT
t.object_id,
-- HIGH-RISK PII DETECTION (Most Sensitive Data)
-- These patterns indicate data requiring the strongest protection and access controls
COUNT(CASE
WHEN c.name LIKE '%ssn%' OR c.name LIKE '%social%security%' OR -- Social Security Numbers
c.name LIKE '%password%' OR c.name LIKE '%secret%' OR -- Authentication credentials
c.name LIKE '%credit%card%' OR c.name LIKE '%account%number%' OR -- Financial account data
c.name LIKE '%license%' OR c.name LIKE '%passport%' OR -- Government-issued IDs
c.name LIKE '%tax%id%' -- Tax identification numbers
THEN 1
END) AS high_pii_columns,
-- MEDIUM-RISK PII DETECTION (Personal Identifiers)
-- Personal information that requires significant protection but less critical than high-risk
COUNT(CASE
WHEN c.name LIKE '%first%name%' OR c.name LIKE '%last%name%' OR -- Personal names
c.name LIKE '%full%name%' OR c.name LIKE '%display%name%' OR -- Combined name fields
c.name LIKE '%email%' OR c.name LIKE '%phone%' OR -- Direct contact information
c.name LIKE '%address%' OR c.name LIKE '%street%' OR -- Physical location data
c.name LIKE '%city%' OR c.name LIKE '%zip%' OR c.name LIKE '%postal%' OR -- Location components
c.name LIKE '%birth%' OR c.name LIKE '%dob%' OR -- Birth date information
c.name LIKE '%salary%' OR c.name LIKE '%wage%' OR c.name LIKE '%income%' -- Compensation data
THEN 1
END) AS medium_pii_columns,
-- LOW-RISK PII DETECTION (General Demographics)
-- Demographic information with privacy implications but lower individual risk
COUNT(CASE
WHEN c.name LIKE '%gender%' OR c.name LIKE '%sex%' OR -- Gender identification
c.name LIKE '%age%' OR c.name LIKE '%title%' OR -- Age and professional titles
c.name LIKE '%nationality%' OR c.name LIKE '%ethnicity%' OR -- National/ethnic origin
c.name LIKE '%marital%' OR c.name LIKE '%suffix%' OR c.name LIKE '%prefix%' -- Personal status
THEN 1
END) AS low_pii_columns,
-- TOTAL COLUMN COUNT for calculating PII density percentages
COUNT(c.column_id) AS total_columns
FROM sys.tables t
JOIN sys.columns c ON t.object_id = c.object_id
WHERE t.type = 'U' -- Only user tables
GROUP BY t.object_id
),
-- CTE 3: TABLE USAGE PATTERN ANALYSIS
-- ============================================================================
-- Analyzes actual table access patterns from SQL Server usage statistics
-- Usage frequency is a strong indicator of business value and data trust
TableUsageStats AS (
SELECT
t.object_id,
-- READ OPERATION METRICS: Measure query activity
COALESCE(us.user_seeks + us.user_scans + us.user_lookups, 0) AS total_reads,
-- user_seeks: Index seek operations (efficient targeted queries)
-- user_scans: Table/index scan operations (broader data access)
-- user_lookups: Key lookup operations (retrieving additional columns)
-- WRITE OPERATION METRICS: Measure data modification activity
COALESCE(us.user_updates, 0) AS total_writes,
-- user_updates: INSERT, UPDATE, DELETE operations
-- ACCESS TIMESTAMPS: Track most recent activity for freshness assessment
us.last_user_seek, -- Last index seek timestamp
us.last_user_scan, -- Last table scan timestamp
us.last_user_lookup, -- Last key lookup timestamp
-- DAYS SINCE ACCESS CALCULATION: Convert timestamps to business-friendly metrics
CASE
WHEN us.last_user_seek IS NOT NULL OR us.last_user_scan IS NOT NULL OR us.last_user_lookup IS NOT NULL
THEN DATEDIFF(day, COALESCE(us.last_user_seek, us.last_user_scan, us.last_user_lookup), GETDATE())
ELSE NULL -- NULL indicates no recorded access since SQL restart
END AS days_since_access
FROM sys.tables t
-- LEFT JOIN allows for tables with no usage statistics (new or unused tables)
LEFT JOIN sys.dm_db_index_usage_stats us ON t.object_id = us.object_id AND us.index_id IN (0,1)
WHERE t.type = 'U' -- Only user tables
),
-- CTE 4: TABLE DATA QUALITY ASSESSMENT
-- ============================================================================
-- Evaluates data quality indicators that contribute to trustworthiness
-- Quality assessment combines structural integrity, constraints, and maintenance
TableQualityMetrics AS (
SELECT
t.object_id,
-- COMPLETENESS METRIC: Measures data density and required field coverage
-- Higher percentage of NOT NULL columns indicates better data completeness
CAST(COUNT(CASE WHEN c.is_nullable = 0 THEN 1 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS completeness_pct,
-- CONSISTENCY METRIC: Measures data integrity controls and relationships
-- Combines identity columns, foreign keys, and check constraints as quality indicators
CAST(
(COUNT(CASE WHEN c.is_identity = 1 THEN 1 END) + -- Auto-increment columns (system-managed)
COUNT(CASE WHEN fkc.parent_column_id IS NOT NULL THEN 1 END) + -- Foreign key relationships
COUNT(CASE WHEN cc.object_id IS NOT NULL THEN 1 END) -- Check constraint validations
) * 100.0 / COUNT(*)
AS DECIMAL(5,2)) AS consistency_score,
-- FRESHNESS METRIC: Measures how recently the table structure was maintained
-- Recent modifications indicate active maintenance and data currency
CASE
WHEN DATEDIFF(day, t.modify_date, GETDATE()) <= 7 THEN 100.0 -- Modified within last week
WHEN DATEDIFF(day, t.modify_date, GETDATE()) <= 30 THEN 80.0 -- Modified within last month
WHEN DATEDIFF(day, t.modify_date, GETDATE()) <= 90 THEN 60.0 -- Modified within last quarter
WHEN DATEDIFF(day, t.modify_date, GETDATE()) <= 365 THEN 40.0 -- Modified within last year
WHEN DATEDIFF(day, t.modify_date, GETDATE()) <= 730 THEN 20.0 -- Modified within last 2 years
ELSE 10.0 -- Very stale (2+ years old)
END AS freshness_score
FROM sys.tables t
JOIN sys.columns c ON t.object_id = c.object_id
-- LEFT JOINs handle tables without foreign keys or check constraints gracefully
LEFT JOIN sys.foreign_key_columns fkc ON c.object_id = fkc.parent_object_id AND c.column_id = fkc.parent_column_id
LEFT JOIN sys.check_constraints cc ON t.object_id = cc.parent_object_id
GROUP BY t.object_id, t.modify_date
)
-- ============================================================================
-- MAIN INSERT STATEMENT: TABLE-LEVEL KPI CALCULATION
-- ============================================================================
-- This section combines all CTE data to calculate comprehensive table-level KPIs
-- The calculations use weighted algorithms to produce actionable business metrics
PRINT 'Calculating comprehensive KPIs and populating results table...';
INSERT INTO #TableLevelAnalysis (
schema_name, table_name, full_table_name, column_count, estimated_rows, table_size_mb,
table_trust_score, table_trust_level, table_sensitivity_level, table_sensitivity_score,
completeness_percentage, consistency_score, freshness_score,
total_reads, total_writes, days_since_last_access, usage_frequency,
high_risk_pii_columns, medium_risk_pii_columns, low_risk_pii_columns, pii_risk_assessment,
business_criticality, data_classification, encryption_required, audit_logging_required,
access_control_level, governance_recommendation
)
SELECT
-- ========================================================================
-- BASIC TABLE IDENTIFICATION
-- ========================================================================
tm.schema_name, -- Schema identifier for business context
tm.table_name, -- Table name for asset identification
tm.schema_name + '.' + tm.table_name AS full_table_name, -- Fully qualified name for referencing
tm.column_count, -- Structural complexity indicator
tm.estimated_rows, -- Data volume indicator
tm.table_size_mb, -- Storage consumption indicator
-- ========================================================================
-- TABLE-LEVEL TRUST SCORE CALCULATION (0-100 Scale)
-- ========================================================================
-- Trust Score Algorithm combines four weighted components optimized for table-level analysis:
-- 1. Data Quality Component (40%): Completeness + Consistency + Freshness
-- 2. Usage Patterns Component (30%): Access frequency and recency indicate business value
-- 3. Technical Health Component (20%): Storage efficiency and structural appropriateness
-- 4. Business Context Component (10%): Schema-based importance weighting
CAST(
-- COMPONENT 1: DATA QUALITY (40% of total score)
-- Combines three quality dimensions with internal weighting
(
(tqm.completeness_pct * 0.4) + -- Completeness: 40% of quality (16% of total)
(tqm.consistency_score * 0.35) + -- Consistency: 35% of quality (14% of total)
(tqm.freshness_score * 0.25) -- Freshness: 25% of quality (10% of total)
) * 0.4 +
-- COMPONENT 2: USAGE PATTERNS (30% of total score)
-- Recent and frequent access indicates active business reliance and data value
(CASE
-- Excellent usage: High volume + very recent access (within 1 week)
WHEN tus.total_reads > 10000 AND ISNULL(tus.days_since_access, 999) <= 7 THEN 30.0
-- Good usage: Regular access + recent activity (within 1 month)
WHEN tus.total_reads > 1000 AND ISNULL(tus.days_since_access, 999) <= 30 THEN 25.0
-- Fair usage: Moderate access + quarterly activity (within 3 months)
WHEN tus.total_reads > 100 AND ISNULL(tus.days_since_access, 999) <= 90 THEN 20.0
-- Low usage: Some access + semi-annual activity (within 6 months)
WHEN tus.total_reads > 10 AND ISNULL(tus.days_since_access, 999) <= 180 THEN 15.0
-- Minimal usage: Very low access but some activity recorded
WHEN tus.total_reads > 0 THEN 10.0
-- No usage: No recorded access since SQL Server restart
ELSE 5.0
END) +
-- COMPONENT 3: TECHNICAL HEALTH (20% of total score)
-- Assesses storage efficiency and size appropriateness for the data volume
(CASE
-- Excellent health: Appropriate size + efficient storage
WHEN tm.estimated_rows BETWEEN 100 AND 10000000 AND tm.table_size_mb < 5000 THEN 20.0
-- Good health: Reasonable size + acceptable storage
WHEN tm.estimated_rows BETWEEN 10 AND 100000000 AND tm.table_size_mb < 10000 THEN 16.0
-- Fair health: Has data but may have storage issues
WHEN tm.estimated_rows > 0 THEN 12.0
-- Poor health: Empty table or storage problems
ELSE 8.0
END) +
-- COMPONENT 4: BUSINESS CONTEXT (10% of total score)
-- Schema-based importance weighting reflects known business priorities
(CASE tm.schema_name
WHEN 'Sales' THEN 10.0 -- Revenue operations: highest business priority
WHEN 'Person' THEN 9.0 -- Customer data: critical for customer service
WHEN 'HumanResources' THEN 9.0 -- HR data: important for employee management
WHEN 'Production' THEN 8.0 -- Manufacturing: operational importance
WHEN 'Purchasing' THEN 7.0 -- Procurement: supporting business function
ELSE 6.0 -- Other schemas: standard business importance
END)
AS DECIMAL(5,2)) AS table_trust_score,
-- ========================================================================
-- TABLE TRUST LEVEL CLASSIFICATION
-- ========================================================================
-- Converts numerical trust scores into categorical levels for executive reporting
-- These categories help business users quickly understand table reliability without numbers
CASE
WHEN (
-- Recalculate trust score for threshold comparison (ensures consistency)
(
(tqm.completeness_pct * 0.4) +
(tqm.consistency_score * 0.35) +
(tqm.freshness_score * 0.25)
) * 0.4 +
(CASE
WHEN tus.total_reads > 10000 AND ISNULL(tus.days_since_access, 999) <= 7 THEN 30.0
WHEN tus.total_reads > 1000 AND ISNULL(tus.days_since_access, 999) <= 30 THEN 25.0
WHEN tus.total_reads > 100 AND ISNULL(tus.days_since_access, 999) <= 90 THEN 20.0
WHEN tus.total_reads > 10 AND ISNULL(tus.days_since_access, 999) <= 180 THEN 15.0
WHEN tus.total_reads > 0 THEN 10.0
ELSE 5.0
END) +
(CASE
WHEN tm.estimated_rows BETWEEN 100 AND 10000000 AND tm.table_size_mb < 5000 THEN 20.0
WHEN tm.estimated_rows BETWEEN 10 AND 100000000 AND tm.table_size_mb < 10000 THEN 16.0
WHEN tm.estimated_rows > 0 THEN 12.0
ELSE 8.0
END) +
(CASE tm.schema_name
WHEN 'Sales' THEN 10.0
WHEN 'Person' THEN 9.0
WHEN 'HumanResources' THEN 9.0
WHEN 'Production' THEN 8.0
WHEN 'Purchasing' THEN 7.0
ELSE 6.0
END)
) >= 80 THEN 'VERY_HIGH' -- 80-100: Exceptional reliability, mission-critical quality
WHEN (
(
(tqm.completeness_pct * 0.4) +
(tqm.consistency_score * 0.35) +
(tqm.freshness_score * 0.25)
) * 0.4 +
(CASE
WHEN tus.total_reads > 10000 AND ISNULL(tus.days_since_access, 999) <= 7 THEN 30.0
WHEN tus.total_reads > 1000 AND ISNULL(tus.days_since_access, 999) <= 30 THEN 25.0
WHEN tus.total_reads > 100 AND ISNULL(tus.days_since_access, 999) <= 90 THEN 20.0
WHEN tus.total_reads > 10 AND ISNULL(tus.days_since_access, 999) <= 180 THEN 15.0
WHEN tus.total_reads > 0 THEN 10.0
ELSE 5.0
END) +
(CASE
WHEN tm.estimated_rows BETWEEN 100 AND 10000000 AND tm.table_size_mb < 5000 THEN 20.0
WHEN tm.estimated_rows BETWEEN 10 AND 100000000 AND tm.table_size_mb < 10000 THEN 16.0
WHEN tm.estimated_rows > 0 THEN 12.0
ELSE 8.0
END) +
(CASE tm.schema_name
WHEN 'Sales' THEN 10.0
WHEN 'Person' THEN 9.0
WHEN 'HumanResources' THEN 9.0
WHEN 'Production' THEN 8.0
WHEN 'Purchasing' THEN 7.0
ELSE 6.0
END)
) >= 65 THEN 'HIGH' -- 65-79: Good reliability, suitable for business decisions
WHEN (
(
(tqm.completeness_pct * 0.4) +
(tqm.consistency_score * 0.35) +
(tqm.freshness_score * 0.25)
) * 0.4 +
(CASE
WHEN tus.total_reads > 10000 AND ISNULL(tus.days_since_access, 999) <= 7 THEN 30.0
WHEN tus.total_reads > 1000 AND ISNULL(tus.days_since_access, 999) <= 30 THEN 25.0
WHEN tus.total_reads > 100 AND ISNULL(tus.days_since_access, 999) <= 90 THEN 20.0
WHEN tus.total_reads > 10 AND ISNULL(tus.days_since_access, 999) <= 180 THEN 15.0
WHEN tus.total_reads > 0 THEN 10.0
ELSE 5.0
END) +
(CASE
WHEN tm.estimated_rows BETWEEN 100 AND 10000000 AND tm.table_size_mb < 5000 THEN 20.0
WHEN tm.estimated_rows BETWEEN 10 AND 100000000 AND tm.table_size_mb < 10000 THEN 16.0
WHEN tm.estimated_rows > 0 THEN 12.0
ELSE 8.0
END) +
(CASE tm.schema_name
WHEN 'Sales' THEN 10.0
WHEN 'Person' THEN 9.0
WHEN 'HumanResources' THEN 9.0
WHEN 'Production' THEN 8.0
WHEN 'Purchasing' THEN 7.0
ELSE 6.0
END)
) >= 50 THEN 'MEDIUM' -- 50-64: Moderate reliability, acceptable for most uses
WHEN (
(
(tqm.completeness_pct * 0.4) +
(tqm.consistency_score * 0.35) +
(tqm.freshness_score * 0.25)
) * 0.4 +
(CASE
WHEN tus.total_reads > 10000 AND ISNULL(tus.days_since_access, 999) <= 7 THEN 30.0
WHEN tus.total_reads > 1000 AND ISNULL(tus.days_since_access, 999) <= 30 THEN 25.0
WHEN tus.total_reads > 100 AND ISNULL(tus.days_since_access, 999) <= 90 THEN 20.0
WHEN tus.total_reads > 10 AND ISNULL(tus.days_since_access, 999) <= 180 THEN 15.0
WHEN tus.total_reads > 0 THEN 10.0
ELSE 5.0
END) +
(CASE
WHEN tm.estimated_rows BETWEEN 100 AND 10000000 AND tm.table_size_mb < 5000 THEN 20.0
WHEN tm.estimated_rows BETWEEN 10 AND 100000000 AND tm.table_size_mb < 10000 THEN 16.0
WHEN tm.estimated_rows > 0 THEN 12.0
ELSE 8.0
END) +
(CASE tm.schema_name
WHEN 'Sales' THEN 10.0
WHEN 'Person' THEN 9.0
WHEN 'HumanResources' THEN 9.0
WHEN 'Production' THEN 8.0
WHEN 'Purchasing' THEN 7.0
ELSE 6.0
END)
) >= 35 THEN 'LOW' -- 35-49: Low reliability, improvement needed before critical use
ELSE 'VERY_LOW' -- 0-34: Very poor reliability, significant issues present
END AS table_trust_level,
-- ========================================================================
-- TABLE-LEVEL SENSITIVITY CLASSIFICATION (HIGH/MEDIUM/LOW)
-- ========================================================================
-- Determines data sensitivity based on PII content analysis, business function, and impact scale
-- This classification drives security controls, access policies, and compliance requirements
CASE
-- HIGH SENSITIVITY: Contains critical PII or sensitive schemas with significant exposure
-- Triggers strongest security controls and regulatory compliance monitoring
WHEN tpia.high_pii_columns > 0 OR -- Any government IDs, financial, auth data
(tm.schema_name IN ('Person', 'HumanResources') AND tpia.medium_pii_columns > 1) OR -- HR/Customer with multiple PII columns
(tpia.medium_pii_columns > 3 AND tm.estimated_rows > 1000) -- High PII density with significant volume
THEN 'HIGH'
-- MEDIUM SENSITIVITY: Contains some PII or handles business-critical operations with scale
-- Requires standard security controls and regular governance oversight
WHEN tpia.medium_pii_columns > 0 OR -- Any personal identifiers present
(tm.schema_name = 'Sales' AND tm.estimated_rows > 5000) OR -- Large sales datasets (business critical)
(tm.schema_name IN ('Production', 'Purchasing') AND tm.estimated_rows > 10000) OR -- Large operational datasets
tpia.low_pii_columns > 2 -- Multiple demographic columns
THEN 'MEDIUM'
-- LOW SENSITIVITY: Minimal PII content, reference data, or small operational tables
-- Standard governance controls sufficient
ELSE 'LOW'
END AS table_sensitivity_level,
-- ========================================================================
-- TABLE SENSITIVITY SCORE (0-100)
-- ========================================================================
-- Numerical sensitivity score combining PII risk levels with business context and data volume
-- This score enables algorithmic processing and comparative analysis across tables
CASE
-- CRITICAL sensitivity scoring (85-100): Government IDs, financial data, authentication
-- These tables pose maximum privacy risk and require strongest controls
WHEN tpia.high_pii_columns > 0 THEN 85 + LEAST(tpia.high_pii_columns * 5, 15) -- Base 85 + up to 15 bonus
-- HIGH sensitivity scoring (70-84): Personal data in sensitive business contexts
-- Tables with significant personal data exposure in HR or customer domains
WHEN tm.schema_name = 'Person' AND tpia.medium_pii_columns > 1 THEN 75 + (tpia.medium_pii_columns * 3)
WHEN tm.schema_name = 'HumanResources' AND tpia.medium_pii_columns > 1 THEN 73 + (tpia.medium_pii_columns * 3)
WHEN tpia.medium_pii_columns > 3 AND tm.estimated_rows > 1000 THEN 70 + (tpia.medium_pii_columns * 2)
-- MEDIUM sensitivity scoring (40-69): General personal data or business-critical information
-- Standard business data with some privacy implications or significant business value
WHEN tpia.medium_pii_columns > 0 THEN 45 + (tpia.medium_pii_columns * 4) + (tpia.low_pii_columns)
WHEN tm.schema_name = 'Sales' AND tm.estimated_rows > 5000 THEN 55
WHEN tm.schema_name = 'Sales' THEN 50
WHEN tm.schema_name IN ('Production', 'Purchasing') AND tm.estimated_rows > 10000 THEN 52
WHEN tm.schema_name IN ('Production', 'Purchasing') THEN 45
WHEN tpia.low_pii_columns > 2 THEN 40 + (tpia.low_pii_columns * 2)
-- LOW sensitivity scoring (5-39): Reference data, operational tables, minimal risk
-- Standard governance controls adequate for these tables
WHEN tpia.low_pii_columns > 0 THEN 25 + (tpia.low_pii_columns * 3)
WHEN tm.estimated_rows > 10000 THEN 30 -- Large reference tables
WHEN tm.estimated_rows > 1000 THEN 20 -- Medium reference tables
WHEN tm.estimated_rows > 0 THEN 15 -- Small operational tables
ELSE 5 -- Empty or minimal tables
END AS table_sensitivity_score,
-- ========================================================================
-- DETAILED QUALITY METRICS (From CTE Calculations)
-- ========================================================================
-- Provide transparency into the components that make up the trust score
tqm.completeness_pct AS completeness_percentage, -- % of NOT NULL columns
tqm.consistency_score, -- % of columns with integrity controls
tqm.freshness_score, -- Recency-based score (0-100)
-- ========================================================================
-- USAGE PATTERN METRICS (From CTE Calculations)
-- ========================================================================
-- Actual system usage statistics that indicate business value and active reliance
tus.total_reads, -- Total SELECT operations since restart
tus.total_writes, -- Total DML operations since restart
tus.days_since_access AS days_since_last_access, -- Business-friendly access recency
-- USAGE FREQUENCY CLASSIFICATION
-- Converts numerical usage into business-friendly categories for reporting
CASE
WHEN tus.total_reads > 10000 AND ISNULL(tus.days_since_access, 999) <= 7 THEN 'VERY_HIGH' -- Mission-critical access
WHEN tus.total_reads > 1000 AND ISNULL(tus.days_since_access, 999) <= 30 THEN 'HIGH' -- Regular business use
WHEN tus.total_reads > 100 AND ISNULL(tus.days_since_access, 999) <= 90 THEN 'MEDIUM' -- Periodic access
WHEN tus.total_reads > 10 AND ISNULL(tus.days_since_access, 999) <= 180 THEN 'LOW' -- Occasional use
WHEN tus.total_reads > 0 THEN 'VERY_LOW' -- Minimal access
ELSE 'NO_USAGE' -- Dormant table
END AS usage_frequency,
-- ========================================================================
-- PII ANALYSIS BREAKDOWN (From CTE Calculations)
-- ========================================================================
-- Detailed breakdown of privacy-sensitive column counts for compliance reporting
tpia.high_pii_columns AS high_risk_pii_columns, -- Government IDs, financial, auth (highest risk)
tpia.medium_pii_columns AS medium_risk_pii_columns, -- Names, contact, demographic (significant risk)
tpia.low_pii_columns AS low_risk_pii_columns, -- General demographic (lower risk)
-- PII RISK ASSESSMENT CLASSIFICATION
-- Overall privacy risk assessment combining all PII column types
CASE
WHEN tpia.high_pii_columns > 0 THEN 'CRITICAL' -- Any high-risk PII = critical concern
WHEN tpia.medium_pii_columns > 2 THEN 'HIGH' -- Multiple personal identifiers
WHEN tpia.medium_pii_columns > 0 THEN 'MEDIUM' -- Some personal identifiers
WHEN tpia.low_pii_columns > 0 THEN 'LOW' -- Only demographic data
ELSE 'MINIMAL' -- No detected PII
END AS pii_risk_assessment,
-- ========================================================================
-- BUSINESS CONTEXT AND IMPORTANCE ASSESSMENT
-- ========================================================================
-- Maps technical characteristics to business language for stakeholder communication
-- BUSINESS CRITICALITY based on schema context and data volume
CASE
WHEN tm.schema_name = 'Sales' AND tm.estimated_rows > 10000 THEN 'MISSION_CRITICAL' -- Large revenue data
WHEN tm.schema_name IN ('Person', 'Sales') AND tm.estimated_rows > 1000 THEN 'BUSINESS_CRITICAL' -- Core business data
WHEN tm.schema_name IN ('HumanResources', 'Production') AND tm.estimated_rows > 1000 THEN 'OPERATIONAL' -- Key operations
WHEN tm.estimated_rows > 1000 THEN 'SUPPORTING' -- Supporting functions
WHEN tm.estimated_rows > 0 THEN 'REFERENCE' -- Reference/lookup data
ELSE 'MINIMAL' -- Empty or test tables
END AS business_criticality,
-- DATA CLASSIFICATION for access control and handling procedures
CASE
WHEN tpia.high_pii_columns > 0 OR tm.schema_name = 'Person'
THEN 'HIGHLY_CONFIDENTIAL - Personal/Sensitive Data' -- Strongest protection required
WHEN tpia.medium_pii_columns > 0 OR tm.schema_name IN ('HumanResources', 'Sales')
THEN 'CONFIDENTIAL - Business Sensitive' -- Strong protection required
WHEN tm.schema_name IN ('Production', 'Purchasing')
THEN 'INTERNAL - Operational Data' -- Internal use only
ELSE 'PUBLIC - Reference Data' -- Standard access controls
END AS data_classification,
-- ========================================================================
-- GOVERNANCE REQUIREMENTS AND CONTROLS
-- ========================================================================
-- Specific technical controls required based on sensitivity and compliance needs
-- ENCRYPTION REQUIREMENT FLAG
-- Determines if table data should be encrypted at rest or in transit
CASE
WHEN tpia.high_pii_columns > 0 OR tm.schema_name IN ('Person', 'HumanResources') THEN 1
ELSE 0
END AS encryption_required,
-- AUDIT LOGGING REQUIREMENT FLAG
-- Determines if access to this table should be logged for compliance
CASE
WHEN tpia.high_pii_columns > 0 OR -- Any critical PII
tpia.medium_pii_columns > 1 OR -- Multiple personal identifiers
tm.schema_name IN ('Person', 'HumanResources') -- Sensitive business domains
THEN 1
ELSE 0
END AS audit_logging_required,
-- ACCESS CONTROL LEVEL
-- Determines the appropriate level of access restrictions
CASE
WHEN tpia.high_pii_columns > 0 THEN 'RESTRICTED' -- Highest security, minimal access
WHEN tpia.medium_pii_columns > 0 OR tm.schema_name IN ('Person', 'HumanResources', 'Sales')
THEN 'CONTROLLED' -- Role-based access control
WHEN tm.estimated_rows > 10000 OR tm.schema_name IN ('Production', 'Purchasing')
THEN 'MANAGED' -- Standard business controls
ELSE 'STANDARD' -- Default access policies
END AS access_control_level,
-- ========================================================================
-- GOVERNANCE RECOMMENDATIONS
-- ========================================================================
-- Simplified, actionable recommendation based on overall risk assessment
CASE
WHEN tpia.high_pii_columns > 0 THEN 'URGENT: Implement field-level encryption and restrict access immediately.'
WHEN tm.schema_name IN ('Person', 'HumanResources') AND tpia.medium_pii_columns > 1
THEN 'PRIORITY: Apply comprehensive data protection controls and access monitoring.'
WHEN tpia.medium_pii_columns > 0 THEN 'RECOMMENDED: Consider data masking and appropriate access controls.'
WHEN tm.estimated_rows = 0 THEN 'REVIEW: Empty table - consider removal if no longer needed.'
WHEN ISNULL(tus.days_since_access, 999) > 90 THEN 'REVIEW: Low usage - consider archival or lifecycle management.'
ELSE 'STANDARD: Apply standard governance practices and periodic monitoring.'
END AS governance_recommendation
-- ========================================================================
-- DATA SOURCE INTEGRATION
-- ========================================================================
-- Combine all CTE results to create comprehensive table-level analysis
FROM TableMetrics tm
JOIN TablePIIAnalysis tpia ON tm.object_id = tpia.object_id
JOIN TableUsageStats tus ON tm.object_id = tus.object_id
JOIN TableQualityMetrics tqm ON tm.object_id = tqm.object_id;
PRINT 'Table-level KPI calculations completed successfully!';
PRINT 'Analyzed ' + CAST((SELECT COUNT(*) FROM #TableLevelAnalysis) AS VARCHAR(10)) + ' tables across all schemas.';
PRINT '';
-- ============================================================================
-- TABLE-LEVEL RESULTS DISPLAY AND REPORTING
-- ============================================================================
-- Comprehensive dashboard showing table-level KPIs optimized for governance teams
-- This section formats the calculated metrics for executive and operational reporting
PRINT 'Generating comprehensive TABLE-LEVEL KPI DASHBOARD...';
PRINT '';
SELECT
'TABLE-LEVEL KPI DASHBOARD' AS report_section,
schema_name,
table_name,
-- ========================================================================
-- FOUNDATIONAL TABLE CHARACTERISTICS
-- ========================================================================
-- Basic table information that provides context for all other metrics
column_count, -- Structural complexity indicator
FORMAT(estimated_rows, 'N0') AS estimated_rows, -- Data volume (formatted with commas)
FORMAT(table_size_mb, 'N2') + ' MB' AS table_size, -- Storage consumption (formatted to 2 decimals)
-- ========================================================================
-- TRUST SCORE KPIs (Primary Reliability Indicators)
-- ========================================================================
-- These metrics help data consumers understand data reliability at a glance
table_trust_score, -- Numerical score (0-100) for algorithms
table_trust_level, -- Categorical level for business users
-- ========================================================================
-- SENSITIVITY KPIs (Primary Risk Indicators)
-- ========================================================================
-- These metrics drive security policies and access control decisions
table_sensitivity_level, -- HIGH/MEDIUM/LOW classification
table_sensitivity_score, -- Numerical score (0-100) for algorithms
pii_risk_assessment, -- CRITICAL/HIGH/MEDIUM/LOW/MINIMAL privacy risk
-- ========================================================================
-- DATA QUALITY KPIs (Trust Score Components)
-- ========================================================================
-- Detailed breakdown showing why the trust score is high or low
FORMAT(completeness_percentage, 'N1') + '%' AS completeness, -- % NOT NULL columns (data density)
FORMAT(consistency_score, 'N1') + '%' AS consistency, -- % columns with integrity controls
FORMAT(freshness_score, 'N1') + '%' AS freshness, -- Recency score (recent updates = higher score)
-- ========================================================================
-- USAGE PATTERN KPIs (Business Value Indicators)
-- ========================================================================
-- These metrics show actual business reliance and system integration
FORMAT(total_reads, 'N0') AS total_reads, -- Total queries since restart (formatted)
FORMAT(total_writes, 'N0') AS total_writes, -- Total modifications since restart (formatted)
ISNULL(CAST(days_since_last_access AS VARCHAR), 'Never') + ' days' AS last_access, -- Human-readable access recency
usage_frequency, -- Categorical usage level for quick assessment
-- ========================================================================
-- PII ANALYSIS BREAKDOWN (Compliance Detail)
-- ========================================================================
-- Granular privacy risk breakdown for compliance teams and legal review
high_risk_pii_columns, -- Count of government ID, financial, auth columns
medium_risk_pii_columns, -- Count of name, contact, demographic columns
low_risk_pii_columns, -- Count of general demographic columns
-- ========================================================================
-- BUSINESS CONTEXT AND CLASSIFICATION
-- ========================================================================
-- Maps technical data to business language for stakeholder communication
business_criticality, -- Business importance level
-- Format boolean flags as human-readable YES/NO for executive reports
CASE WHEN encryption_required = 1 THEN 'YES' ELSE 'NO' END AS encryption_required,
CASE WHEN audit_logging_required = 1 THEN 'YES' ELSE 'NO' END AS audit_logging_required,
access_control_level, -- Required access restriction level
-- ========================================================================
-- ACTIONABLE GOVERNANCE RECOMMENDATIONS
-- ========================================================================
governance_recommendation -- Specific actions for data stewards
FROM #TableLevelAnalysis
-- ========================================================================
-- RESULT ORDERING FOR GOVERNANCE PRIORITIZATION
-- ========================================================================
-- Sort results to show highest-risk and most important tables first
-- This ordering helps governance teams focus on the most critical items
ORDER BY
-- Primary sort: Sensitivity level (highest risk gets immediate attention)
CASE table_sensitivity_level
WHEN 'HIGH' THEN 1 -- High sensitivity tables reviewed first
WHEN 'MEDIUM' THEN 2 -- Medium sensitivity tables reviewed second
ELSE 3 -- Low sensitivity tables reviewed last
END,
-- Secondary sort: Trust score (lowest trust scores need most attention within same sensitivity)
table_trust_score ASC, -- Lower scores = less reliable = higher priority for improvement
-- Tertiary sort: Table size (larger tables have more business impact within same risk/trust)
table_size_mb DESC; -- Larger tables affect more data and more users
-- ============================================================================
-- SUMMARY STATISTICS AND ANALYSIS INSIGHTS
-- ============================================================================
PRINT '';
PRINT '============================================================================';
PRINT 'TABLE-LEVEL ANALYSIS SUMMARY STATISTICS';
PRINT '============================================================================';
SELECT
'ANALYSIS SUMMARY' AS summary_section,
COUNT(*) AS total_tables_analyzed,
-- Trust Score Distribution
COUNT(CASE WHEN table_trust_level = 'VERY_HIGH' THEN 1 END) AS very_high_trust_tables,
COUNT(CASE WHEN table_trust_level = 'HIGH' THEN 1 END) AS high_trust_tables,
COUNT(CASE WHEN table_trust_level = 'MEDIUM' THEN 1 END) AS medium_trust_tables,
COUNT(CASE WHEN table_trust_level IN ('LOW', 'VERY_LOW') THEN 1 END) AS low_trust_tables,
-- Sensitivity Distribution
COUNT(CASE WHEN table_sensitivity_level = 'HIGH' THEN 1 END) AS high_sensitivity_tables,
COUNT(CASE WHEN table_sensitivity_level = 'MEDIUM' THEN 1 END) AS medium_sensitivity_tables,
COUNT(CASE WHEN table_sensitivity_level = 'LOW' THEN 1 END) AS low_sensitivity_tables,
-- Risk Indicators
COUNT(CASE WHEN pii_risk_assessment = 'CRITICAL' THEN 1 END) AS critical_pii_risk_tables,
COUNT(CASE WHEN encryption_required = 1 THEN 1 END) AS tables_requiring_encryption,
COUNT(CASE WHEN audit_logging_required = 1 THEN 1 END) AS tables_requiring_audit_logging,
-- Average Scores
FORMAT(AVG(table_trust_score), 'N1') AS avg_trust_score,
FORMAT(AVG(table_sensitivity_score), 'N1') AS avg_sensitivity_score
FROM #TableLevelAnalysis;
-- ============================================================================
-- HIGH PRIORITY ITEMS REQUIRING IMMEDIATE ATTENTION
-- ============================================================================
SELECT
'HIGH PRIORITY ITEMS FOR IMMEDIATE REVIEW' AS priority_section,
schema_name + '.' + table_name AS table_name,
table_sensitivity_level,
table_trust_level,
pii_risk_assessment,
business_criticality,
CASE WHEN encryption_required = 1 THEN 'YES' ELSE 'NO' END AS encryption_needed,
governance_recommendation
FROM #TableLevelAnalysis
WHERE table_sensitivity_level = 'HIGH'
OR pii_risk_assessment = 'CRITICAL'
OR (table_sensitivity_level = 'MEDIUM' AND table_trust_level IN ('LOW', 'VERY_LOW'))
ORDER BY
CASE table_sensitivity_level WHEN 'HIGH' THEN 1 ELSE 2 END,
CASE pii_risk_assessment WHEN 'CRITICAL' THEN 1 WHEN 'HIGH' THEN 2 ELSE 3 END,
table_trust_score ASC;
-- ============================================================================
-- ANALYSIS COMPLETION AND INTEGRATION GUIDANCE
-- ============================================================================
PRINT '';
PRINT '============================================================================';
PRINT 'TABLE-LEVEL ANALYSIS COMPLETED SUCCESSFULLY!';
PRINT '============================================================================';
PRINT 'Analysis completion timestamp: ' + CAST(GETDATE() AS VARCHAR(50));
PRINT '';
PRINT 'RESULTS SUMMARY:';
PRINT '• Trust scores calculated using 4-component weighted algorithm';
PRINT '• Sensitivity levels: HIGH/MEDIUM/LOW with detailed PII risk assessment';
PRINT '• Quality metrics: Completeness, consistency, freshness percentages';
PRINT '• Usage patterns: Access frequency and business value indicators';
PRINT '• Governance requirements: Encryption, audit logging, access control levels';
PRINT '';
PRINT 'NEXT STEPS FOR DATA GOVERNANCE:';
PRINT '1. Review HIGH PRIORITY ITEMS table above for immediate actions';
PRINT '2. Implement encryption for tables marked as encryption_required = YES';
PRINT '3. Establish audit logging for tables marked as audit_logging_required = YES';
PRINT '4. Address low trust scores through data quality improvement initiatives';
PRINT '5. Apply appropriate access controls based on access_control_level classifications';
PRINT '';
PRINT 'PENTAHO DATA CATALOG INTEGRATION:';
PRINT '• Use table_trust_score directly as Trust Score (0-100 scale)';
PRINT '• Map table_sensitivity_level: HIGH=3, MEDIUM=2, LOW=1 for Sensitivity metric';
PRINT '• Apply pii_risk_assessment as PII Risk tagging';
PRINT '• Use business_criticality for Business Impact classification';
PRINT '• Implement governance_recommendation as automated policy suggestions';
PRINT '';
PRINT 'Data available in #TableLevelAnalysis temp table for further analysis.';
PRINT '============================================================================';
Column Analysis
Now we're getting into the weeds ..!
Column-level metrics provide executive oversight and departmental governance KPIs.
-- ============================================================================
-- COMPLETE DATABASE COLUMN ANALYSIS - ALL COLUMNS
-- ============================================================================
-- PURPOSE: This script performs comprehensive analysis of EVERY column in EVERY table
-- across ALL schemas in the database. Unlike the PII-focused script,
-- this analyzes all columns regardless of whether they contain PII.
--
-- METHODOLOGY:
-- - Uses multiple CTEs to break down complex analysis into logical components
-- - Applies universal scoring algorithms to all columns (no filtering)
-- - Combines technical, business, and governance perspectives
-- - Provides actionable recommendations for every column
--
-- SCOPE:
-- - All user tables in all schemas (excludes system tables and views)
-- - Every column regardless of data type or naming convention
-- - Complete technical metadata analysis
-- - Universal quality and trust scoring (0-100 scale)
-- - Comprehensive governance recommendations
--
-- SCORING METHODOLOGY:
-- - Overall Score: Weighted combination of naming (20%) + data type (25%) + constraints (25%) + business value (30%)
-- - Technical Score: Data type appropriateness + implementation quality
-- - Structural Score: Constraints, keys, and relationships
-- - Naming Score: Convention adherence and descriptiveness
-- - Business Score: Inferred importance and domain relevance
--
-- OUTPUTS:
-- 1. #CompleteColumnAnalysis - Every column with full analysis (main results table)
-- 2. Executive Summary - High-level statistics and KPIs
-- 3. Schema-level Breakdown - Analysis grouped by schema
-- 4. Top Priority Report - Urgent items requiring immediate attention
-- 5. Data type Distribution - Technical metadata summary
-- ============================================================================
-- ============================================================================
-- CREATE COMPREHENSIVE RESULTS TABLE
-- ============================================================================
-- This table will store the complete analysis results for every column in the database
-- Each column gets a comprehensive evaluation across multiple dimensions
CREATE TABLE #CompleteColumnAnalysis (
-- Primary Key and Identification
analysis_id INT IDENTITY(1,1) PRIMARY KEY, -- Unique identifier for each analysis record
-- ========================================================================
-- COLUMN IDENTIFICATION SECTION
-- ========================================================================
-- Basic information to uniquely identify each column in the database
database_name NVARCHAR(128) NOT NULL, -- Database name (for multi-DB scenarios)
schema_name NVARCHAR(128) NOT NULL, -- Schema containing the table
table_name NVARCHAR(128) NOT NULL, -- Table containing the column
column_name NVARCHAR(128) NOT NULL, -- Column name within the table
ordinal_position INT NOT NULL, -- Column position in table (1, 2, 3, etc.)
full_column_name NVARCHAR(400) NOT NULL, -- Fully qualified name (DB.Schema.Table.Column)
-- ========================================================================
-- COMPLETE TECHNICAL METADATA SECTION
-- ========================================================================
-- Comprehensive technical details about each column's data type and properties
data_type_category VARCHAR(50) NOT NULL, -- Categorized data type (TEXT, INTEGER, NUMERIC, etc.)
base_data_type NVARCHAR(128) NOT NULL, -- Raw SQL Server data type (varchar, int, datetime, etc.)
formatted_data_type NVARCHAR(200) NOT NULL, -- Type with length/precision (varchar(50), decimal(10,2))
max_length INT, -- Maximum length for string/binary types
precision_value TINYINT, -- Numeric precision (total digits)
scale_value TINYINT, -- Numeric scale (decimal places)
character_set_name NVARCHAR(128), -- Character set for text types (ASCII/UNICODE)
collation_name NVARCHAR(128), -- Collation for text types (sorting rules)
-- ========================================================================
-- COLUMN PROPERTIES SECTION
-- ========================================================================
-- Binary flags indicating various column characteristics and constraints
-- Every column gets analyzed for these properties regardless of type or purpose
is_nullable BIT NOT NULL, -- 1 = allows NULL values, 0 = NOT NULL constraint
is_identity BIT NOT NULL, -- 1 = identity/auto-increment column
is_computed BIT NOT NULL, -- 1 = computed column (formula-based)
is_primary_key BIT NOT NULL, -- 1 = part of primary key constraint
is_foreign_key BIT NOT NULL, -- 1 = part of foreign key constraint
is_unique_key BIT NOT NULL, -- 1 = part of unique constraint (non-PK)
is_indexed BIT NOT NULL, -- 1 = has index (any type) for performance
has_default_constraint BIT NOT NULL, -- 1 = has default value constraint
has_check_constraint BIT NOT NULL, -- 1 = has check constraint (business rules)
-- ========================================================================
-- UNIVERSAL QUALITY SCORING SECTION (0-100 scale for ALL columns)
-- ========================================================================
-- Multi-dimensional quality assessment applied to every column
overall_column_score DECIMAL(5,2) NOT NULL, -- Composite quality score (weighted average of all factors)
technical_quality_score DECIMAL(5,2) NOT NULL, -- Technical implementation quality (data types, structure)
structural_integrity_score DECIMAL(5,2) NOT NULL, -- Constraints and relationships quality
naming_quality_score DECIMAL(5,2) NOT NULL, -- Naming convention adherence and clarity
business_value_score DECIMAL(5,2) NOT NULL, -- Inferred business importance and relevance
-- ========================================================================
-- QUALITY LEVEL CLASSIFICATIONS
-- ========================================================================
-- Categorical interpretations of the numeric scores for easier understanding
overall_quality_level VARCHAR(20) NOT NULL, -- EXCELLENT(90+)/GOOD(75-89)/FAIR(60-74)/POOR(<60)
technical_quality_level VARCHAR(20) NOT NULL, -- Technical implementation assessment
structural_quality_level VARCHAR(20) NOT NULL, -- Constraint and relationship assessment
naming_quality_level VARCHAR(20) NOT NULL, -- Naming convention assessment
-- ========================================================================
-- UNIVERSAL SENSITIVITY ANALYSIS SECTION
-- ========================================================================
-- Security and privacy assessment applied to ALL columns (not just obvious PII)
data_sensitivity_category VARCHAR(30) NOT NULL, -- HIGHLY_SENSITIVE/SENSITIVE/MODERATELY_SENSITIVE/LOW_SENSITIVITY/PUBLIC
potential_pii_indicator VARCHAR(50), -- Detected PII category (if any pattern matches)
pii_risk_assessment VARCHAR(30) NOT NULL, -- CRITICAL_RISK/HIGH_RISK/MEDIUM_RISK/LOW_RISK/MINIMAL_RISK
business_criticality VARCHAR(20) NOT NULL, -- CRITICAL/HIGH/MEDIUM/STANDARD (business importance)
-- ========================================================================
-- COMPLETE GOVERNANCE ANALYSIS
-- ========================================================================
-- Data governance recommendations and requirements for every column
data_classification VARCHAR(50) NOT NULL, -- RESTRICTED/CONFIDENTIAL/SENSITIVE/BUSINESS_CRITICAL/INTERNAL
access_control_recommendation VARCHAR(30) NOT NULL, -- HIGHLY_RESTRICTED/RESTRICTED/CONTROLLED/MANAGED/STANDARD
encryption_recommendation VARCHAR(30) NOT NULL, -- MANDATORY/REQUIRED/RECOMMENDED/NOT_REQUIRED
masking_recommendation VARCHAR(30) NOT NULL, -- REQUIRED/RECOMMENDED/NOT_REQUIRED (for non-prod environments)
monitoring_priority VARCHAR(20) NOT NULL, -- CONTINUOUS/HIGH/MEDIUM/STANDARD (access monitoring level)
-- ========================================================================
-- BUSINESS CONTEXT ANALYSIS
-- ========================================================================
-- Inferred business purpose and domain classification for each column
inferred_purpose VARCHAR(200) NOT NULL, -- Likely business purpose based on naming and context
domain_category VARCHAR(50) NOT NULL, -- Business domain (SALES_REVENUE, CUSTOMER_DATA, etc.)
data_lifecycle_stage VARCHAR(30) NOT NULL, -- ACTIVE_TRANSACTIONAL/ACTIVE_REFERENCE/HISTORICAL_ARCHIVE/TEMPORARY
-- ========================================================================
-- COMPLIANCE AND REGULATORY SECTION
-- ========================================================================
-- Legal and regulatory considerations for each column
regulatory_scope NVARCHAR(500) NOT NULL, -- Applicable regulations (GDPR, CCPA, HIPAA, etc.)
retention_category VARCHAR(50) NOT NULL, -- Data retention classification and requirements
compliance_risk_level VARCHAR(20) NOT NULL, -- Regulatory compliance risk assessment
-- ========================================================================
-- ACTIONABLE RECOMMENDATIONS SECTION
-- ========================================================================
-- Specific, actionable guidance for data stewards and administrators
immediate_actions VARCHAR(300), -- Urgent actions needed (if any) - NULL if none required
recommended_improvements VARCHAR(300), -- Suggested improvements for better governance/quality
governance_priority VARCHAR(20) NOT NULL, -- URGENT/HIGH/MEDIUM/LOW (priority for governance attention)
steward_assignment VARCHAR(50) NOT NULL, -- Suggested data steward type/role for ownership
-- ========================================================================
-- ANALYSIS METADATA
-- ========================================================================
-- Information about when and how the analysis was performed
analysis_timestamp DATETIME DEFAULT GETDATE(), -- When this analysis was run
analysis_version VARCHAR(10) DEFAULT '1.0' -- Version of the analysis methodology
);
-- ============================================================================
-- COMPREHENSIVE COLUMN ANALYSIS - MAIN QUERY WITH MULTIPLE CTEs
-- ============================================================================
-- Using Common Table Expressions (CTEs) to break down the complex analysis
-- into manageable, logical components. Each CTE focuses on a specific aspect
-- of the analysis, making the code more readable and maintainable.
WITH
-- ============================================================================
-- CTE 1: COMPLETE COLUMN INVENTORY
-- ============================================================================
-- PURPOSE: Catalog EVERY column in EVERY user table across ALL schemas
-- SCOPE: Universal - no filtering, every column included regardless of purpose
-- METHOD: Joins system catalog views to extract complete technical metadata
-- OUTPUT: Base dataset with all columns and their technical characteristics
AllColumnsInventory AS (
SELECT
-- Core identifiers for joining with other CTEs
c.object_id, -- Table object ID (internal SQL Server identifier)
c.column_id, -- Column ID within the table (internal identifier)
-- Human-readable identifiers
DB_NAME() as database_name, -- Current database name
s.name AS schema_name, -- Schema name (dbo, Sales, Person, etc.)
t.name AS table_name, -- Table name
c.name AS column_name, -- Column name
c.column_id AS ordinal_position, -- Position of column in table definition
-- Fully qualified name for unique identification across database
DB_NAME() + '.' + s.name + '.' + t.name + '.' + c.name AS full_column_name,
-- ====================================================================
-- COMPLETE TECHNICAL DETAILS FOR EVERY COLUMN
-- ====================================================================
ty.name AS base_data_type, -- Base SQL Server data type name
c.max_length, -- Maximum length (relevant for strings/binary)
c.precision AS precision_value, -- Numeric precision (total significant digits)
c.scale AS scale_value, -- Numeric scale (digits after decimal point)
c.is_nullable, -- NULL constraint indicator
c.is_identity, -- Identity column indicator
c.is_computed, -- Computed column indicator
c.collation_name, -- Text collation (sorting/comparison rules)
-- ====================================================================
-- ENHANCED DATA TYPE CATEGORIZATION FOR ALL TYPES
-- ====================================================================
-- Groups SQL Server's many data types into logical categories
-- This helps with standardized analysis across different specific types
CASE
-- Text/String data types - for names, descriptions, codes
WHEN ty.name IN ('char', 'varchar', 'text', 'nchar', 'nvarchar', 'ntext') THEN 'TEXT'
-- Integer data types - for counts, IDs, flags
WHEN ty.name IN ('int', 'bigint', 'smallint', 'tinyint') THEN 'INTEGER'
-- Decimal/Floating point data types - for money, measurements, calculations
WHEN ty.name IN ('decimal', 'numeric', 'money', 'smallmoney', 'float', 'real') THEN 'NUMERIC'
-- Date/Time data types - for timestamps, dates, durations
WHEN ty.name IN ('date', 'datetime', 'datetime2', 'smalldatetime', 'time', 'datetimeoffset') THEN 'TEMPORAL'
-- Boolean data types - for true/false, yes/no flags
WHEN ty.name = 'bit' THEN 'BOOLEAN'
-- Unique identifier data types - for GUIDs, unique keys
WHEN ty.name = 'uniqueidentifier' THEN 'IDENTIFIER'
-- Binary data types - for images, documents, encrypted data
WHEN ty.name IN ('binary', 'varbinary', 'image') THEN 'BINARY'
-- Structured data types - for XML documents
WHEN ty.name = 'xml' THEN 'STRUCTURED'
-- Spatial data types - for geographic/geometric data
WHEN ty.name IN ('geography', 'geometry') THEN 'SPATIAL'
-- Hierarchical data types - for organizational structures
WHEN ty.name = 'hierarchyid' THEN 'HIERARCHICAL'
-- System data types - for internal SQL Server use
WHEN ty.name IN ('sql_variant', 'timestamp', 'rowversion') THEN 'SYSTEM'
-- Catch-all for any other data types
ELSE 'OTHER'
END AS data_type_category,
-- ====================================================================
-- FORMATTED DATA TYPE WITH FULL SPECIFICATIONS
-- ====================================================================
-- Creates human-readable data type strings with length/precision info
-- Examples: varchar(50), decimal(10,2), datetime2(7)
ty.name +
CASE
-- String types: show length or MAX
WHEN ty.name IN ('varchar', 'nvarchar', 'char', 'nchar') THEN
'(' + CASE WHEN c.max_length = -1 THEN 'MAX' ELSE CAST(c.max_length AS VARCHAR(10)) END + ')'
-- Numeric types: show precision and scale
WHEN ty.name IN ('decimal', 'numeric') THEN
'(' + CAST(c.precision AS VARCHAR(10)) + ',' + CAST(c.scale AS VARCHAR(10)) + ')'
-- Float: show precision if not default
WHEN ty.name IN ('float') THEN
CASE WHEN c.precision != 53 THEN '(' + CAST(c.precision AS VARCHAR(10)) + ')' ELSE '' END
-- DateTime types: show scale (fractional seconds precision)
WHEN ty.name IN ('datetime2', 'time', 'datetimeoffset') THEN
'(' + CAST(c.scale AS VARCHAR(10)) + ')'
-- Other types: no additional formatting needed
ELSE ''
END AS formatted_data_type,
-- ====================================================================
-- CHARACTER SET INFORMATION FOR TEXT COLUMNS
-- ====================================================================
-- Identifies whether text columns use ASCII or Unicode encoding
CASE
WHEN ty.name IN ('char', 'varchar', 'text') THEN 'ASCII' -- Single-byte character sets
WHEN ty.name IN ('nchar', 'nvarchar', 'ntext') THEN 'UNICODE' -- Multi-byte Unicode support
ELSE NULL -- Not applicable to non-text types
END AS character_set_name
FROM sys.schemas s
JOIN sys.tables t ON s.schema_id = t.schema_id -- Join schemas to tables
JOIN sys.columns c ON t.object_id = c.object_id -- Join tables to columns
JOIN sys.types ty ON c.user_type_id = ty.user_type_id -- Join to get data type information
WHERE t.type = 'U' -- Only user tables (excludes system tables, views, etc.)
),
-- ============================================================================
-- CTE 2: COMPREHENSIVE CONSTRAINT ANALYSIS
-- ============================================================================
-- PURPOSE: Analyze ALL types of constraints for EVERY column
-- SCOPE: Primary keys, foreign keys, unique constraints, indexes, defaults, checks
-- METHOD: Multiple LEFT JOINs to constraint system views to detect all constraint types
-- OUTPUT: Binary flags and scoring for constraint coverage per column
CompleteConstraintAnalysis AS (
SELECT
c.object_id, -- Table identifier for joining
c.column_id, -- Column identifier for joining
-- ====================================================================
-- CONSTRAINT DETECTION FLAGS
-- ====================================================================
-- Each flag indicates whether the column participates in that constraint type
-- Primary Key Detection
-- Checks if column is part of the table's primary key constraint
CASE WHEN pk.column_id IS NOT NULL THEN 1 ELSE 0 END AS is_primary_key,
-- Foreign Key Detection
-- Checks if column references another table (referential integrity)
CASE WHEN fk.parent_column_id IS NOT NULL THEN 1 ELSE 0 END AS is_foreign_key,
-- Unique Constraint Detection (excluding primary key)
-- Checks for unique constraints that aren't the primary key
CASE WHEN uk.column_id IS NOT NULL THEN 1 ELSE 0 END AS is_unique_key,
-- Index Detection (any type of index)
-- Checks if column has any index for performance optimization
CASE WHEN idx.column_id IS NOT NULL THEN 1 ELSE 0 END AS is_indexed,
-- Default Constraint Detection
-- Checks if column has a default value specified
CASE WHEN dc.parent_column_id IS NOT NULL THEN 1 ELSE 0 END AS has_default_constraint,
-- Check Constraint Detection (table-level, conservative approach)
-- Checks if table has check constraints (business rule validation)
CASE WHEN cc.parent_object_id IS NOT NULL THEN 1 ELSE 0 END AS has_check_constraint,
-- ====================================================================
-- CONSTRAINT SCORE CALCULATION (0-100+ points possible)
-- ====================================================================
-- Weighted scoring system that rewards different types of constraints
-- Higher scores indicate better data integrity and reliability
(
-- Primary Key: 25 points (highest value - uniqueness + not null guaranteed)
CASE WHEN pk.column_id IS NOT NULL THEN 25 ELSE 0 END +
-- Foreign Key: 20 points (referential integrity, relationships)
CASE WHEN fk.parent_column_id IS NOT NULL THEN 20 ELSE 0 END +
-- Unique Constraint: 15 points (data uniqueness without being PK)
CASE WHEN uk.column_id IS NOT NULL THEN 15 ELSE 0 END +
-- Default Constraint: 10 points (prevents null insertion issues)
CASE WHEN dc.parent_column_id IS NOT NULL THEN 10 ELSE 0 END +
-- Check Constraint: 10 points (business rule enforcement)
CASE WHEN cc.parent_object_id IS NOT NULL THEN 10 ELSE 0 END +
-- Index: 10 points (performance optimization, indicates importance)
CASE WHEN idx.column_id IS NOT NULL THEN 10 ELSE 0 END +
-- Identity: 15 points (system-managed, guaranteed uniqueness)
CASE WHEN c.is_identity = 1 THEN 15 ELSE 0 END +
-- NOT NULL: 5 points (basic data quality requirement)
CASE WHEN c.is_nullable = 0 THEN 5 ELSE 0 END
) AS total_constraint_score
FROM sys.columns c
-- ====================================================================
-- PRIMARY KEY DETECTION SUBQUERY
-- ====================================================================
-- Identifies columns that are part of primary key constraints
LEFT JOIN (
SELECT ic.object_id, ic.column_id
FROM sys.indexes i
JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id
WHERE i.is_primary_key = 1 -- Only primary key indexes
) pk ON c.object_id = pk.object_id AND c.column_id = pk.column_id
-- ====================================================================
-- FOREIGN KEY DETECTION
-- ====================================================================
-- Identifies columns that reference other tables (parent side of FK relationship)
LEFT JOIN sys.foreign_key_columns fk ON c.object_id = fk.parent_object_id
AND c.column_id = fk.parent_column_id
-- ====================================================================
-- UNIQUE CONSTRAINT DETECTION SUBQUERY
-- ====================================================================
-- Identifies columns with unique constraints (excluding primary keys)
LEFT JOIN (
SELECT ic.object_id, ic.column_id
FROM sys.indexes i
JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id
WHERE i.is_unique = 1 AND i.is_primary_key = 0 -- Unique but not primary key
) uk ON c.object_id = uk.object_id AND c.column_id = uk.column_id
-- ====================================================================
-- INDEX DETECTION SUBQUERY
-- ====================================================================
-- Identifies columns that have any type of index (for performance)
LEFT JOIN (
SELECT DISTINCT ic.object_id, ic.column_id
FROM sys.indexes i
JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id
-- Includes all index types: clustered, nonclustered, unique, etc.
) idx ON c.object_id = idx.object_id AND c.column_id = idx.column_id
-- ====================================================================
-- DEFAULT CONSTRAINT DETECTION
-- ====================================================================
-- Identifies columns that have default value constraints
LEFT JOIN sys.default_constraints dc ON c.object_id = dc.parent_object_id
AND c.column_id = dc.parent_column_id
-- ====================================================================
-- CHECK CONSTRAINT DETECTION (TABLE LEVEL)
-- ====================================================================
-- Conservative approach: if table has check constraints, gives credit to all columns
-- More sophisticated column-specific detection would require parsing constraint definitions
LEFT JOIN (
SELECT DISTINCT parent_object_id
FROM sys.check_constraints
) cc ON c.object_id = cc.parent_object_id
),
-- ============================================================================
-- CTE 3: UNIVERSAL NAMING AND PATTERN ANALYSIS
-- ============================================================================
-- PURPOSE: Analyze naming patterns and conventions for ALL columns
-- SCOPE: Every column gets naming quality assessment and pattern detection
-- METHOD: Pattern matching, length analysis, convention checking, PII detection
-- OUTPUT: Naming quality scores, PII risk assessment, data type appropriateness
UniversalNamingAnalysis AS (
SELECT
aci.object_id,
aci.column_id,
aci.column_name,
aci.schema_name,
aci.table_name,
aci.data_type_category,
-- ====================================================================
-- COMPREHENSIVE NAMING QUALITY ANALYSIS
-- ====================================================================
-- Evaluates naming conventions, length, clarity, and best practices
-- Column Name Length Assessment (useful for quality evaluation)
LEN(aci.column_name) AS name_length,
-- ====================================================================
-- NAMING CONVENTION SCORE (0-100 SCALE)
-- ====================================================================
-- Multi-factor assessment of naming quality based on industry best practices
CASE
-- EXCELLENT NAMING (95 points): Meets all best practice criteria
-- - Appropriate length (not too short, not too long)
-- - Avoids temporary/test naming patterns
-- - Descriptive and meaningful
-- - Follows standard conventions
WHEN LEN(aci.column_name) BETWEEN 3 AND 50 AND
aci.column_name NOT LIKE '%temp%' AND -- Avoids temporary naming
aci.column_name NOT LIKE '%old%' AND -- Avoids legacy naming
aci.column_name NOT LIKE '%bak%' AND -- Avoids backup naming
aci.column_name NOT LIKE '%test%' AND -- Avoids test naming
aci.column_name NOT LIKE '%1' AND -- Avoids numbered suffixes
aci.column_name NOT LIKE '%2' AND -- Avoids numbered suffixes
aci.column_name NOT LIKE 'col%' AND -- Avoids generic "column" names
aci.column_name NOT LIKE 'field%' AND -- Avoids generic "field" names
aci.column_name <> 'id' AND -- Avoids non-descriptive "id"
aci.column_name NOT LIKE '%_%_%_%' -- Not overly complex with underscores
THEN 95
-- GOOD NAMING (80 points): Meets most standards, minor issues
-- - Reasonable length and avoids obvious bad patterns
WHEN LEN(aci.column_name) BETWEEN 2 AND 75 AND
aci.column_name NOT LIKE '%temp%' AND
aci.column_name NOT LIKE '%old%' AND
aci.column_name NOT LIKE '%test%'
THEN 80
-- FAIR NAMING (65 points): Acceptable but could improve
-- - Basic length requirements met, some issues present
WHEN LEN(aci.column_name) BETWEEN 2 AND 100 AND
aci.column_name NOT LIKE '%temp%'
THEN 65
-- POOR NAMING (40 points): Below standards, needs improvement
-- - Too short/long, or contains problematic patterns
WHEN LEN(aci.column_name) < 2 OR LEN(aci.column_name) > 100 OR
aci.column_name LIKE '%temp%' OR aci.column_name LIKE '%old%'
THEN 40
-- VERY POOR NAMING (20 points): Major issues requiring attention
ELSE 20
END AS naming_quality_score,
-- ====================================================================
-- UNIVERSAL PII PATTERN DETECTION
-- ====================================================================
-- Apply PII detection to ALL columns, not just suspected ones
-- Uses broader pattern matching to catch potential PII across all naming styles
-- ====================================================================
-- POTENTIAL PII CATEGORY (COMPREHENSIVE DETECTION)
-- ====================================================================
-- Analyzes column names for patterns indicating different types of personal data
CASE
-- GOVERNMENT AND LEGAL IDENTIFIERS
-- Highest risk category - legal penalties for misuse
WHEN aci.column_name LIKE '%ssn%' OR aci.column_name LIKE '%social%security%' OR
aci.column_name LIKE '%tax%id%' OR aci.column_name LIKE '%ein%' OR
aci.column_name LIKE '%passport%' OR aci.column_name LIKE '%license%'
THEN 'GOVERNMENT_LEGAL_ID'
-- FINANCIAL INFORMATION
-- Protected by financial regulations (PCI-DSS, banking laws)
WHEN aci.column_name LIKE '%credit%card%' OR aci.column_name LIKE '%account%number%' OR
aci.column_name LIKE '%salary%' OR aci.column_name LIKE '%wage%' OR
aci.column_name LIKE '%income%' OR aci.column_name LIKE '%pay%'
THEN 'FINANCIAL_DATA'
-- PERSONAL NAMES
-- Can enable identity theft when combined with other data
WHEN aci.column_name LIKE '%name%' OR aci.column_name LIKE '%first%' OR
aci.column_name LIKE '%last%' OR aci.column_name LIKE '%middle%' OR
aci.column_name LIKE '%display%'
THEN 'PERSONAL_NAME'
-- CONTACT INFORMATION
-- Used for communication but also phishing/social engineering
WHEN aci.column_name LIKE '%email%' OR aci.column_name LIKE '%mail%' OR
aci.column_name LIKE '%phone%' OR aci.column_name LIKE '%mobile%' OR
aci.column_name LIKE '%telephone%'
THEN 'CONTACT_INFO'
-- ADDRESS INFORMATION
-- Location data, delivery information, can reveal patterns
WHEN aci.column_name LIKE '%address%' OR aci.column_name LIKE '%street%' OR
aci.column_name LIKE '%city%' OR aci.column_name LIKE '%state%' OR
aci.column_name LIKE '%zip%' OR aci.column_name LIKE '%postal%' OR
aci.column_name LIKE '%country%'
THEN 'ADDRESS_INFO'
-- DEMOGRAPHIC DATA
-- Protected classes, potential discrimination risks
WHEN aci.column_name LIKE '%birth%' OR aci.column_name LIKE '%dob%' OR
aci.column_name LIKE '%age%' OR aci.column_name LIKE '%gender%' OR
aci.column_name LIKE '%sex%' OR aci.column_name LIKE '%race%' OR
aci.column_name LIKE '%ethnicity%'
THEN 'DEMOGRAPHIC_DATA'
-- AUTHENTICATION/SECURITY
-- System access credentials and security tokens
WHEN aci.column_name LIKE '%password%' OR aci.column_name LIKE '%secret%' OR
aci.column_name LIKE '%token%' OR aci.column_name LIKE '%hash%' OR
aci.column_name LIKE '%key%'
THEN 'AUTHENTICATION_SECURITY'
-- HEALTH INFORMATION
-- Protected by health privacy laws (HIPAA, etc.)
WHEN aci.column_name LIKE '%medical%' OR aci.column_name LIKE '%health%' OR
aci.column_name LIKE '%diagnosis%' OR aci.column_name LIKE '%treatment%'
THEN 'HEALTH_DATA'
-- SYSTEM/TECHNICAL FIELDS
-- Usually safe but important for system integrity
WHEN aci.column_name LIKE '%id' OR aci.column_name LIKE '%key' OR
aci.column_name LIKE '%guid%' OR aci.column_name LIKE '%uuid%'
THEN 'SYSTEM_IDENTIFIER'
-- TEMPORAL FIELDS
-- Time-based data for auditing and analysis
WHEN aci.column_name LIKE '%date%' OR aci.column_name LIKE '%time%' OR
aci.column_name LIKE '%created%' OR aci.column_name LIKE '%modified%' OR
aci.column_name LIKE '%updated%'
THEN 'TEMPORAL_DATA'
-- FINANCIAL/BUSINESS METRICS
-- Business-sensitive numerical data
WHEN aci.column_name LIKE '%amount%' OR aci.column_name LIKE '%price%' OR
aci.column_name LIKE '%cost%' OR aci.column_name LIKE '%value%' OR
aci.column_name LIKE '%total%'
THEN 'BUSINESS_METRIC'
-- STATUS/CONTROL FIELDS
-- Workflow and process control indicators
WHEN aci.column_name LIKE '%status%' OR aci.column_name LIKE '%flag%' OR
aci.column_name LIKE '%active%' OR aci.column_name LIKE '%enabled%'
THEN 'STATUS_CONTROL'
-- DEFAULT CATEGORY
-- Columns that don't match specific patterns
ELSE 'GENERAL_DATA'
END AS potential_pii_category,
-- ====================================================================
-- PII RISK ASSESSMENT FOR ALL COLUMNS
-- ====================================================================
-- Risk level assessment based on potential impact of data exposure
CASE
-- CRITICAL RISK: Immediate identity theft or system compromise possible
-- Legal violations, financial fraud, system breaches
WHEN aci.column_name LIKE '%ssn%' OR aci.column_name LIKE '%social%security%' OR
aci.column_name LIKE '%passport%' OR aci.column_name LIKE '%credit%card%' OR
aci.column_name LIKE '%password%'
THEN 'CRITICAL_RISK'
-- HIGH RISK: Significant personal or financial harm possible
-- Identity reconstruction, financial loss, privacy violations
WHEN aci.column_name LIKE '%name%' OR aci.column_name LIKE '%email%' OR
aci.column_name LIKE '%phone%' OR aci.column_name LIKE '%address%' OR
aci.column_name LIKE '%salary%' OR aci.column_name LIKE '%birth%'
THEN 'HIGH_RISK'
-- MEDIUM RISK: Moderate privacy concerns, partial identity exposure
-- Can be problematic in combination with other data
WHEN aci.column_name LIKE '%age%' OR aci.column_name LIKE '%gender%' OR
aci.column_name LIKE '%city%' OR aci.column_name LIKE '%state%'
THEN 'MEDIUM_RISK'
-- LOW RISK: Limited privacy impact, usually safe when isolated
-- System fields, non-personal identifiers
WHEN aci.column_name LIKE '%id%' OR aci.column_name LIKE '%date%' OR
aci.column_name LIKE '%flag%' OR aci.column_name LIKE '%status%'
THEN 'LOW_RISK'
-- MINIMAL RISK: Generally safe, public or non-sensitive data
ELSE 'MINIMAL_RISK'
END AS pii_risk_assessment,
-- ====================================================================
-- DATA TYPE APPROPRIATENESS FOR ALL COLUMNS
-- ====================================================================
-- Evaluates whether data type choice is appropriate for column name/purpose
-- Higher scores indicate better alignment between name and data type
CASE
-- PERFECT MATCHES (95-100 points): Ideal data type for the purpose
WHEN aci.column_name LIKE '%date%' AND aci.data_type_category = 'TEMPORAL' THEN 100
WHEN aci.column_name LIKE '%time%' AND aci.data_type_category = 'TEMPORAL' THEN 100
WHEN aci.column_name LIKE '%amount%' AND aci.data_type_category = 'NUMERIC' THEN 100
WHEN aci.column_name LIKE '%price%' AND aci.data_type_category = 'NUMERIC' THEN 100
WHEN aci.column_name LIKE '%count%' AND aci.data_type_category = 'INTEGER' THEN 100
WHEN aci.column_name LIKE '%flag%' AND aci.data_type_category = 'BOOLEAN' THEN 100
WHEN aci.column_name LIKE '%id' AND aci.data_type_category IN ('INTEGER', 'IDENTIFIER') THEN 95
WHEN aci.column_name LIKE '%name%' AND aci.data_type_category = 'TEXT' THEN 90
WHEN aci.column_name LIKE '%description%' AND aci.data_type_category = 'TEXT' THEN 90
-- GOOD MATCHES (75-89 points): Appropriate but not perfect
WHEN aci.data_type_category = 'TEXT' AND aci.column_name LIKE '%code%' THEN 85
WHEN aci.data_type_category = 'TEXT' AND aci.column_name LIKE '%address%' THEN 85
WHEN aci.data_type_category = 'NUMERIC' AND aci.column_name LIKE '%rate%' THEN 80
WHEN aci.data_type_category = 'INTEGER' AND aci.column_name LIKE '%number%' THEN 80
-- REASONABLE MATCHES (60-74 points): Generally acceptable
WHEN aci.data_type_category = 'TEXT' THEN 70
WHEN aci.data_type_category = 'INTEGER' THEN 65
WHEN aci.data_type_category = 'NUMERIC' THEN 65
WHEN aci.data_type_category = 'TEMPORAL' THEN 70
WHEN aci.data_type_category = 'BOOLEAN' THEN 65
-- BELOW AVERAGE MATCHES (40-59 points): Questionable choices
WHEN aci.data_type_category = 'BINARY' THEN 50
WHEN aci.data_type_category = 'SYSTEM' THEN 45
-- POOR MATCHES (0-39 points): Likely inappropriate data type choices
ELSE 40
END AS data_type_appropriateness_score
FROM AllColumnsInventory aci
),
-- ============================================================================
-- CTE 4: BUSINESS CONTEXT AND VALUE ASSESSMENT
-- ============================================================================
-- PURPOSE: Assess business value and context for ALL columns
-- METHOD: Schema-based and naming-based business value inference
-- SCOPE: Every column gets business importance scoring and domain classification
-- OUTPUT: Business value scores, criticality assessment, domain categorization
BusinessContextAnalysis AS (
SELECT
una.object_id,
una.column_id,
una.schema_name,
una.table_name,
una.column_name,
-- ====================================================================
-- BUSINESS VALUE SCORING (0-100 SCALE)
-- ====================================================================
-- Determines business importance based on schema, naming, and patterns
-- Higher scores indicate greater business importance and value
-- Three-component scoring system with weighted contributions:
-- COMPONENT 1: SCHEMA-BASED BUSINESS VALUE (40% OF TOTAL SCORE)
-- Different schemas typically have different business importance levels
CASE una.schema_name
WHEN 'Sales' THEN 40 -- Revenue generating activities - highest priority
WHEN 'Person' THEN 35 -- Customer data - critical for business
WHEN 'HumanResources' THEN 35 -- Employee data - legal and operational importance
WHEN 'Production' THEN 30 -- Operations data - important for manufacturing
WHEN 'Purchasing' THEN 25 -- Procurement data - cost management
WHEN 'dbo' THEN 20 -- Default schema - variable importance
ELSE 15 -- Other schemas - standard importance
END +
-- COMPONENT 2: COLUMN NAME BUSINESS VALUE INDICATORS (35% OF TOTAL SCORE)
-- Column names that suggest high business value or critical operations
CASE
-- Revenue/Financial keywords - highest business impact
WHEN una.column_name LIKE '%revenue%' OR una.column_name LIKE '%sales%' OR
una.column_name LIKE '%profit%' OR una.column_name LIKE '%income%' THEN 35
-- Customer/Client keywords - critical for customer relationships
WHEN una.column_name LIKE '%customer%' OR una.column_name LIKE '%client%' OR
una.column_name LIKE '%name%' THEN 30
-- Product/Order keywords - core business operations
WHEN una.column_name LIKE '%product%' OR una.column_name LIKE '%order%' OR
una.column_name LIKE '%price%' OR una.column_name LIKE '%amount%' THEN 25
-- Temporal/Status keywords - tracking and workflow
WHEN una.column_name LIKE '%date%' OR una.column_name LIKE '%time%' OR
una.column_name LIKE '%status%' THEN 20
-- Identifier keywords - system integrity
WHEN una.column_name LIKE '%id%' OR una.column_name LIKE '%key%' THEN 15
-- Descriptive keywords - supporting information
WHEN una.column_name LIKE '%description%' OR una.column_name LIKE '%note%' THEN 10
-- Other columns - basic value
ELSE 5
END +
-- COMPONENT 3: DATA TYPE BUSINESS VALUE (25% OF TOTAL SCORE)
-- Different data types typically have different business analysis value
CASE una.data_type_category
WHEN 'NUMERIC' THEN 25 -- Financial calculations, measurements, KPIs
WHEN 'TEXT' THEN 20 -- Descriptive information, names, codes
WHEN 'TEMPORAL' THEN 20 -- Time-based analysis, trends, auditing
WHEN 'IDENTIFIER' THEN 15 -- Relationships, unique identification
WHEN 'INTEGER' THEN 15 -- Counts, references, flags
WHEN 'BOOLEAN' THEN 10 -- Status indicators, flags
ELSE 5 -- Other types - limited business analysis value
END AS business_value_score,
-- ====================================================================
-- BUSINESS CRITICALITY ASSESSMENT
-- ====================================================================
-- Categorical assessment of how critical this column is to business operations
CASE
-- CRITICAL: Revenue-generating or financial data in key schemas
WHEN (una.schema_name IN ('Sales', 'Person') AND una.column_name LIKE '%amount%') OR
(una.column_name LIKE '%revenue%' OR una.column_name LIKE '%profit%') THEN 'CRITICAL'
-- HIGH: Data in business-critical schemas
WHEN una.schema_name IN ('Sales', 'Person', 'HumanResources') THEN 'HIGH'
-- MEDIUM: Operational data supporting business processes
WHEN una.schema_name IN ('Production', 'Purchasing') THEN 'MEDIUM'
-- STANDARD: General business data with normal importance
ELSE 'STANDARD'
END AS business_criticality,
-- ====================================================================
-- DOMAIN CLASSIFICATION
-- ====================================================================
-- Assigns each column to a business domain for governance and stewardship
CASE
WHEN una.schema_name = 'Sales' THEN 'SALES_REVENUE'
WHEN una.schema_name = 'Person' THEN 'CUSTOMER_DATA'
WHEN una.schema_name = 'HumanResources' THEN 'EMPLOYEE_DATA'
WHEN una.schema_name = 'Production' THEN 'OPERATIONS'
WHEN una.schema_name = 'Purchasing' THEN 'PROCUREMENT'
WHEN una.column_name LIKE '%financial%' OR una.column_name LIKE '%money%' THEN 'FINANCIAL'
WHEN una.column_name LIKE '%audit%' OR una.column_name LIKE '%log%' THEN 'AUDIT_COMPLIANCE'
WHEN una.column_name LIKE '%config%' OR una.column_name LIKE '%setting%' THEN 'CONFIGURATION'
ELSE 'GENERAL_BUSINESS'
END AS domain_category,
-- ====================================================================
-- DATA LIFECYCLE ASSESSMENT
-- ====================================================================
-- Determines the lifecycle stage and usage pattern of the data
CASE
-- Active transactional data - frequently updated, current operations
WHEN una.column_name LIKE '%created%' OR una.column_name LIKE '%modified%' OR
una.column_name LIKE '%updated%' OR una.column_name LIKE '%date%' THEN 'ACTIVE_TRANSACTIONAL'
-- Historical/archived data - preserved for compliance or analysis
WHEN una.column_name LIKE '%archived%' OR una.column_name LIKE '%deleted%' OR
una.column_name LIKE '%historical%' THEN 'HISTORICAL_ARCHIVE'
-- Temporary/staging data - processing intermediate results
WHEN una.column_name LIKE '%temp%' OR una.column_name LIKE '%staging%' THEN 'TEMPORARY_PROCESSING'
-- Active reference data - stable, frequently read
ELSE 'ACTIVE_REFERENCE'
END AS data_lifecycle_stage
FROM UniversalNamingAnalysis una
)
-- ============================================================================
-- MAIN INSERT STATEMENT - COMPLETE ANALYSIS OF ALL COLUMNS
-- ============================================================================
-- Combines all the CTE results and calculates final scores and classifications
-- This INSERT processes every column and generates comprehensive governance analysis
INSERT INTO #CompleteColumnAnalysis (
-- Column identification fields
database_name, schema_name, table_name, column_name, ordinal_position, full_column_name,
-- Technical metadata fields
data_type_category, base_data_type, formatted_data_type, max_length, precision_value, scale_value,
character_set_name, collation_name,
-- Column properties (constraint and structure flags)
is_nullable, is_identity, is_computed, is_primary_key, is_foreign_key, is_unique_key,
is_indexed, has_default_constraint, has_check_constraint,
-- Quality scoring fields
overall_column_score, technical_quality_score, structural_integrity_score,
naming_quality_score, business_value_score,
-- Quality level classifications
overall_quality_level, technical_quality_level, structural_quality_level, naming_quality_level,
-- Sensitivity and risk fields
data_sensitivity_category, potential_pii_indicator, pii_risk_assessment, business_criticality,
-- Governance recommendation fields
data_classification, access_control_recommendation, encryption_recommendation,
masking_recommendation, monitoring_priority,
-- Business context fields
inferred_purpose, domain_category, data_lifecycle_stage,
-- Compliance and regulatory fields
regulatory_scope, retention_category, compliance_risk_level,
-- Actionable recommendation fields
immediate_actions, recommended_improvements, governance_priority, steward_assignment
)
SELECT
-- ====================================================================
-- BASIC COLUMN IDENTIFICATION
-- ====================================================================
aci.database_name,
aci.schema_name,
aci.table_name,
aci.column_name,
aci.ordinal_position,
aci.full_column_name,
-- ====================================================================
-- TECHNICAL METADATA SECTION
-- ====================================================================
aci.data_type_category,
aci.base_data_type,
aci.formatted_data_type,
aci.max_length,
aci.precision_value,
aci.scale_value,
aci.character_set_name,
aci.collation_name,
-- ====================================================================
-- COLUMN PROPERTIES (CONSTRAINTS AND STRUCTURE)
-- ====================================================================
aci.is_nullable,
aci.is_identity,
aci.is_computed,
cca.is_primary_key,
cca.is_foreign_key,
cca.is_unique_key,
cca.is_indexed,
cca.has_default_constraint,
cca.has_check_constraint,
-- ====================================================================
-- COMPREHENSIVE SCORING FOR ALL COLUMNS
-- ====================================================================
-- OVERALL COLUMN SCORE (0-100) - WEIGHTED COMPOSITE OF ALL FACTORS
-- This is the primary quality indicator combining multiple dimensions
-- Weighting rationale:
-- - Business Value (30%): Most important - drives governance priorities
-- - Data Type Appropriateness (25%): Technical quality foundation
-- - Constraints (25%): Data integrity and reliability
-- - Naming Quality (20%): Maintainability and understanding
CAST(
(una.naming_quality_score * 0.20 + -- 20% naming quality
una.data_type_appropriateness_score * 0.25 + -- 25% data type appropriateness
LEAST(cca.total_constraint_score, 100) * 0.25 + -- 25% constraints (capped at 100)
bca.business_value_score * 0.30 -- 30% business value
) AS DECIMAL(5,2)
) AS overall_column_score,
-- TECHNICAL QUALITY SCORE (data type appropriateness + technical implementation)
-- Focuses on technical implementation quality and appropriateness
CAST((una.data_type_appropriateness_score * 0.6 + -- 60% data type fit
CASE WHEN aci.is_computed = 1 THEN 90 -- Computed columns: sophisticated
WHEN aci.is_identity = 1 THEN 85 -- Identity columns: system-managed
ELSE 70 END * 0.4 -- Regular columns: standard baseline
) AS DECIMAL(5,2)) AS technical_quality_score,
-- STRUCTURAL INTEGRITY SCORE (constraints and relationships)
-- Measures data integrity controls and relationship management
CAST(LEAST(cca.total_constraint_score, 100) AS DECIMAL(5,2)) AS structural_integrity_score,
-- NAMING QUALITY SCORE (from naming analysis)
CAST(una.naming_quality_score AS DECIMAL(5,2)) AS naming_quality_score,
-- BUSINESS VALUE SCORE (from business context analysis)
CAST(bca.business_value_score AS DECIMAL(5,2)) AS business_value_score,
-- ====================================================================
-- QUALITY LEVEL CLASSIFICATIONS
-- ====================================================================
-- Convert numeric scores to categorical levels for easier interpretation
-- OVERALL QUALITY LEVEL
-- Based on the weighted composite score calculated above
CASE
WHEN (una.naming_quality_score * 0.20 + una.data_type_appropriateness_score * 0.25 +
LEAST(cca.total_constraint_score, 100) * 0.25 + bca.business_value_score * 0.30) >= 90 THEN 'EXCELLENT'
WHEN (una.naming_quality_score * 0.20 + una.data_type_appropriateness_score * 0.25 +
LEAST(cca.total_constraint_score, 100) * 0.25 + bca.business_value_score * 0.30) >= 75 THEN 'GOOD'
WHEN (una.naming_quality_score * 0.20 + una.data_type_appropriateness_score * 0.25 +
LEAST(cca.total_constraint_score, 100) * 0.25 + bca.business_value_score * 0.30) >= 60 THEN 'FAIR'
ELSE 'POOR'
END AS overall_quality_level,
-- TECHNICAL QUALITY LEVEL
-- Based on data type appropriateness score
CASE
WHEN una.data_type_appropriateness_score >= 90 THEN 'EXCELLENT'
WHEN una.data_type_appropriateness_score >= 75 THEN 'GOOD'
WHEN una.data_type_appropriateness_score >= 60 THEN 'FAIR'
ELSE 'POOR'
END AS technical_quality_level,
-- STRUCTURAL QUALITY LEVEL
-- Based on constraint coverage score
CASE
WHEN cca.total_constraint_score >= 75 THEN 'EXCELLENT'
WHEN cca.total_constraint_score >= 50 THEN 'GOOD'
WHEN cca.total_constraint_score >= 25 THEN 'FAIR'
ELSE 'POOR'
END AS structural_quality_level,
-- NAMING QUALITY LEVEL
-- Based on naming convention score
CASE
WHEN una.naming_quality_score >= 90 THEN 'EXCELLENT'
WHEN una.naming_quality_score >= 75 THEN 'GOOD'
WHEN una.naming_quality_score >= 60 THEN 'FAIR'
ELSE 'POOR'
END AS naming_quality_level,
-- ====================================================================
-- SENSITIVITY AND RISK ANALYSIS
-- ====================================================================
-- DATA SENSITIVITY CATEGORY
-- Maps PII risk assessment to sensitivity categories for governance
CASE
WHEN una.pii_risk_assessment = 'CRITICAL_RISK' THEN 'HIGHLY_SENSITIVE'
WHEN una.pii_risk_assessment = 'HIGH_RISK' THEN 'SENSITIVE'
WHEN una.pii_risk_assessment = 'MEDIUM_RISK' THEN 'MODERATELY_SENSITIVE'
WHEN una.pii_risk_assessment = 'LOW_RISK' THEN 'LOW_SENSITIVITY'
ELSE 'PUBLIC'
END AS data_sensitivity_category,
-- POTENTIAL PII INDICATOR (from naming analysis)
una.potential_pii_category AS potential_pii_indicator,
-- PII RISK ASSESSMENT (from naming analysis)
una.pii_risk_assessment,
-- BUSINESS CRITICALITY (from business context analysis)
bca.business_criticality,
-- ====================================================================
-- GOVERNANCE RECOMMENDATIONS
-- ====================================================================
-- Specific governance controls and requirements based on risk and sensitivity
-- DATA CLASSIFICATION
-- Formal data classification level for access control and handling procedures
CASE
WHEN una.pii_risk_assessment = 'CRITICAL_RISK' THEN 'RESTRICTED'
WHEN una.pii_risk_assessment = 'HIGH_RISK' THEN 'CONFIDENTIAL'
WHEN una.pii_risk_assessment = 'MEDIUM_RISK' THEN 'SENSITIVE'
WHEN bca.business_criticality = 'CRITICAL' THEN 'BUSINESS_CRITICAL'
WHEN bca.business_criticality = 'HIGH' THEN 'BUSINESS_SENSITIVE'
ELSE 'INTERNAL'
END AS data_classification,
-- ACCESS CONTROL RECOMMENDATION
-- Recommended access control level based on sensitivity and risk
CASE
WHEN una.pii_risk_assessment = 'CRITICAL_RISK' THEN 'HIGHLY_RESTRICTED'
WHEN una.pii_risk_assessment = 'HIGH_RISK' THEN 'RESTRICTED'
WHEN una.pii_risk_assessment = 'MEDIUM_RISK' THEN 'CONTROLLED'
WHEN bca.business_criticality IN ('CRITICAL', 'HIGH') THEN 'MANAGED'
ELSE 'STANDARD'
END AS access_control_recommendation,
-- ENCRYPTION RECOMMENDATION
-- Encryption requirements based on sensitivity and regulatory needs
CASE
WHEN una.pii_risk_assessment = 'CRITICAL_RISK' THEN 'MANDATORY'
WHEN una.pii_risk_assessment = 'HIGH_RISK' THEN 'REQUIRED'
WHEN una.pii_risk_assessment = 'MEDIUM_RISK' THEN 'RECOMMENDED'
WHEN bca.business_criticality = 'CRITICAL' THEN 'RECOMMENDED'
ELSE 'NOT_REQUIRED'
END AS encryption_recommendation,
-- MASKING RECOMMENDATION
-- Data masking requirements for non-production environments
CASE
WHEN una.pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK', 'MEDIUM_RISK') THEN 'REQUIRED'
WHEN bca.business_criticality = 'CRITICAL' THEN 'RECOMMENDED'
ELSE 'NOT_REQUIRED'
END AS masking_recommendation,
-- MONITORING PRIORITY
-- Access monitoring and audit logging requirements
CASE
WHEN una.pii_risk_assessment = 'CRITICAL_RISK' THEN 'CONTINUOUS'
WHEN una.pii_risk_assessment = 'HIGH_RISK' THEN 'HIGH'
WHEN una.pii_risk_assessment = 'MEDIUM_RISK' OR bca.business_criticality = 'CRITICAL' THEN 'MEDIUM'
ELSE 'STANDARD'
END AS monitoring_priority,
-- ====================================================================
-- BUSINESS CONTEXT
-- ====================================================================
-- INFERRED PURPOSE
-- Likely business purpose based on naming patterns and context
CASE
WHEN una.potential_pii_category = 'SYSTEM_IDENTIFIER' THEN 'System identification and relationships'
WHEN una.potential_pii_category = 'PERSONAL_NAME' THEN 'Individual identification and communication'
WHEN una.potential_pii_category = 'CONTACT_INFO' THEN 'Communication and customer service'
WHEN una.potential_pii_category = 'ADDRESS_INFO' THEN 'Location services and delivery'
WHEN una.potential_pii_category = 'FINANCIAL_DATA' THEN 'Financial transactions and analysis'
WHEN una.potential_pii_category = 'TEMPORAL_DATA' THEN 'Time tracking and audit trails'
WHEN una.potential_pii_category = 'BUSINESS_METRIC' THEN 'Business analytics and reporting'
WHEN una.potential_pii_category = 'STATUS_CONTROL' THEN 'Process control and workflow management'
ELSE 'General business operations and data storage'
END AS inferred_purpose,
-- DOMAIN CATEGORY (from business context analysis)
bca.domain_category,
-- DATA LIFECYCLE STAGE (from business context analysis)
bca.data_lifecycle_stage,
-- ====================================================================
-- COMPLIANCE AND REGULATORY
-- ====================================================================
-- REGULATORY SCOPE
-- Applicable legal and regulatory frameworks based on data type and schema
CASE
WHEN una.potential_pii_category = 'GOVERNMENT_LEGAL_ID' THEN 'GDPR Art.9, CCPA Sensitive PI, Government ID Laws'
WHEN una.potential_pii_category = 'FINANCIAL_DATA' THEN 'PCI-DSS, SOX, GDPR, CCPA, Financial Regulations'
WHEN una.potential_pii_category IN ('PERSONAL_NAME', 'CONTACT_INFO', 'ADDRESS_INFO') THEN 'GDPR, CCPA, Privacy Laws'
WHEN una.potential_pii_category = 'HEALTH_DATA' THEN 'HIPAA, GDPR Art.9, Health Privacy Laws'
WHEN aci.schema_name = 'HumanResources' THEN 'Employment Laws, GDPR, CCPA'
WHEN aci.schema_name = 'Sales' THEN 'Consumer Protection, GDPR, CCPA'
ELSE 'Standard Data Protection Laws'
END AS regulatory_scope,
-- RETENTION CATEGORY
-- Data retention classification based on type and business context
CASE
WHEN una.potential_pii_category = 'GOVERNMENT_LEGAL_ID' THEN 'LEGAL_RETENTION'
WHEN una.potential_pii_category = 'FINANCIAL_DATA' THEN 'FINANCIAL_RETENTION'
WHEN una.potential_pii_category IN ('PERSONAL_NAME', 'CONTACT_INFO') THEN 'CUSTOMER_RETENTION'
WHEN aci.schema_name = 'HumanResources' THEN 'EMPLOYEE_RETENTION'
WHEN aci.schema_name = 'Sales' THEN 'BUSINESS_RETENTION'
ELSE 'STANDARD_RETENTION'
END AS retention_category,
-- COMPLIANCE RISK LEVEL
-- Overall regulatory compliance risk assessment
CASE
WHEN una.pii_risk_assessment = 'CRITICAL_RISK' THEN 'VERY_HIGH'
WHEN una.pii_risk_assessment = 'HIGH_RISK' THEN 'HIGH'
WHEN una.pii_risk_assessment = 'MEDIUM_RISK' THEN 'MEDIUM'
ELSE 'LOW'
END AS compliance_risk_level,
-- ====================================================================
-- ACTIONABLE RECOMMENDATIONS
-- ====================================================================
-- IMMEDIATE ACTIONS
-- Urgent actions required based on risk level and quality issues
CASE
WHEN una.pii_risk_assessment = 'CRITICAL_RISK'
THEN 'URGENT: Implement encryption, restrict access, enable audit logging'
WHEN una.pii_risk_assessment = 'HIGH_RISK'
THEN 'HIGH PRIORITY: Apply data masking, implement access controls'
WHEN una.naming_quality_score < 40
THEN 'IMPROVE: Review and improve column naming standards'
WHEN cca.total_constraint_score < 20
THEN 'ENHANCE: Add appropriate constraints for data integrity'
ELSE NULL -- No immediate actions required
END AS immediate_actions,
-- RECOMMENDED IMPROVEMENTS
-- Suggested improvements for better governance, quality, or performance
CASE
WHEN una.data_type_appropriateness_score < 60
THEN 'Consider reviewing data type choice for better alignment with data content'
WHEN cca.total_constraint_score < 40
THEN 'Add constraints (NOT NULL, CHECK, FK) to improve data quality'
WHEN una.naming_quality_score < 70
THEN 'Improve naming convention adherence for better maintainability'
WHEN cca.is_indexed = 0 AND (cca.is_foreign_key = 1 OR una.column_name LIKE '%id%')
THEN 'Consider adding index for performance optimization'
ELSE 'No immediate improvements required'
END AS recommended_improvements,
-- GOVERNANCE PRIORITY
-- Priority level for data governance attention and resource allocation
CASE
WHEN una.pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') THEN 'URGENT'
WHEN una.pii_risk_assessment = 'MEDIUM_RISK' OR bca.business_criticality = 'CRITICAL' THEN 'HIGH'
WHEN bca.business_criticality = 'HIGH' OR
(una.naming_quality_score < 50 AND cca.total_constraint_score < 30) THEN 'MEDIUM'
ELSE 'LOW'
END AS governance_priority,
-- STEWARD ASSIGNMENT
-- Suggested data steward type/role based on domain and sensitivity
CASE
WHEN aci.schema_name = 'Sales' THEN 'SALES_DATA_STEWARD'
WHEN aci.schema_name = 'Person' THEN 'CUSTOMER_DATA_STEWARD'
WHEN aci.schema_name = 'HumanResources' THEN 'HR_DATA_STEWARD'
WHEN aci.schema_name = 'Production' THEN 'OPERATIONS_DATA_STEWARD'
WHEN una.pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') THEN 'PRIVACY_OFFICER'
ELSE 'GENERAL_DATA_STEWARD'
END AS steward_assignment
-- ====================================================================
-- JOIN ALL CTEs TO GET COMPLETE ANALYSIS
-- ====================================================================
FROM AllColumnsInventory aci
JOIN CompleteConstraintAnalysis cca ON aci.object_id = cca.object_id AND aci.column_id = cca.column_id
JOIN UniversalNamingAnalysis una ON aci.object_id = una.object_id AND aci.column_id = una.column_id
JOIN BusinessContextAnalysis bca ON aci.object_id = bca.object_id AND aci.column_id = bca.column_id
-- ====================================================================
-- ORDERING RESULTS BY GOVERNANCE PRIORITY
-- ====================================================================
-- Order by governance priority (most critical first), then by schema/table/column
-- This ensures the most important items appear first in the results
ORDER BY
-- Primary sort: Governance priority (most urgent first)
CASE
WHEN una.pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') THEN 1
WHEN una.pii_risk_assessment = 'MEDIUM_RISK' OR bca.business_criticality = 'CRITICAL' THEN 2
WHEN bca.business_criticality = 'HIGH' THEN 3
ELSE 4
END,
-- Secondary sort: Schema name (alphabetical)
aci.schema_name,
-- Tertiary sort: Table name (alphabetical)
aci.table_name,
-- Final sort: Column position in table (logical order)
aci.ordinal_position;
-- ============================================================================
-- COMPLETION MESSAGES AND SUMMARY STATISTICS
-- ============================================================================
PRINT '============================================================================';
PRINT 'COMPLETE DATABASE COLUMN ANALYSIS - FINISHED';
PRINT '============================================================================';
-- Calculate and display summary statistics
-- This provides immediate feedback on analysis results without using variables
SELECT
'ANALYSIS SUMMARY' AS summary_type,
COUNT(*) AS total_columns_analyzed,
SUM(CASE WHEN pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') THEN 1 ELSE 0 END) AS high_risk_columns,
SUM(CASE WHEN overall_quality_level = 'POOR' THEN 1 ELSE 0 END) AS poor_quality_columns,
SUM(CASE WHEN governance_priority IN ('URGENT', 'HIGH') THEN 1 ELSE 0 END) AS priority_attention_needed
FROM #CompleteColumnAnalysis;
PRINT '============================================================================';
-- ============================================================================
-- EXECUTIVE SUMMARY DASHBOARD
-- ============================================================================
-- Comprehensive high-level view of all columns analyzed
-- Provides key metrics and distributions for executive reporting
SELECT
'=== EXECUTIVE SUMMARY - ALL COLUMNS ANALYSIS ===' AS summary_section,
-- ====================================================================
-- COVERAGE METRICS
-- ====================================================================
COUNT(*) AS total_columns_analyzed, -- Total number of columns processed
COUNT(DISTINCT schema_name) AS schemas_analyzed, -- Number of different schemas covered
COUNT(DISTINCT schema_name + '.' + table_name) AS tables_analyzed, -- Number of tables covered
-- ====================================================================
-- QUALITY DISTRIBUTION ANALYSIS
-- ====================================================================
-- Shows how columns are distributed across quality levels
SUM(CASE WHEN overall_quality_level = 'EXCELLENT' THEN 1 ELSE 0 END) AS excellent_quality_columns,
SUM(CASE WHEN overall_quality_level = 'GOOD' THEN 1 ELSE 0 END) AS good_quality_columns,
SUM(CASE WHEN overall_quality_level = 'FAIR' THEN 1 ELSE 0 END) AS fair_quality_columns,
SUM(CASE WHEN overall_quality_level = 'POOR' THEN 1 ELSE 0 END) AS poor_quality_columns,
-- Calculate quality percentages for better understanding
CAST(SUM(CASE WHEN overall_quality_level = 'EXCELLENT' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS excellent_quality_percentage,
CAST(SUM(CASE WHEN overall_quality_level = 'POOR' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS poor_quality_percentage,
-- ====================================================================
-- RISK DISTRIBUTION ANALYSIS
-- ====================================================================
-- Shows distribution of columns across risk levels for security planning
SUM(CASE WHEN pii_risk_assessment = 'CRITICAL_RISK' THEN 1 ELSE 0 END) AS critical_risk_columns,
SUM(CASE WHEN pii_risk_assessment = 'HIGH_RISK' THEN 1 ELSE 0 END) AS high_risk_columns,
SUM(CASE WHEN pii_risk_assessment = 'MEDIUM_RISK' THEN 1 ELSE 0 END) AS medium_risk_columns,
SUM(CASE WHEN pii_risk_assessment = 'LOW_RISK' THEN 1 ELSE 0 END) AS low_risk_columns,
-- Calculate risk percentages
CAST(SUM(CASE WHEN pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS high_risk_percentage,
-- ====================================================================
-- GOVERNANCE PRIORITY DISTRIBUTION
-- ====================================================================
-- Shows workload distribution for governance teams
SUM(CASE WHEN governance_priority = 'URGENT' THEN 1 ELSE 0 END) AS urgent_priority_columns,
SUM(CASE WHEN governance_priority = 'HIGH' THEN 1 ELSE 0 END) AS high_priority_columns,
SUM(CASE WHEN governance_priority = 'MEDIUM' THEN 1 ELSE 0 END) AS medium_priority_columns,
SUM(CASE WHEN governance_priority = 'LOW' THEN 1 ELSE 0 END) AS low_priority_columns,
-- ====================================================================
-- AVERAGE QUALITY SCORES
-- ====================================================================
-- Provides baseline metrics for quality assessment
CAST(AVG(overall_column_score) AS DECIMAL(5,2)) AS avg_overall_score,
CAST(AVG(technical_quality_score) AS DECIMAL(5,2)) AS avg_technical_score,
CAST(AVG(structural_integrity_score) AS DECIMAL(5,2)) AS avg_structural_score,
CAST(AVG(naming_quality_score) AS DECIMAL(5,2)) AS avg_naming_score,
CAST(AVG(business_value_score) AS DECIMAL(5,2)) AS avg_business_value_score,
-- ====================================================================
-- CONSTRAINT COVERAGE ANALYSIS
-- ====================================================================
-- Shows how well columns are protected by constraints
SUM(CASE WHEN is_primary_key = 1 THEN 1 ELSE 0 END) AS primary_key_columns,
SUM(CASE WHEN is_foreign_key = 1 THEN 1 ELSE 0 END) AS foreign_key_columns,
SUM(CASE WHEN is_nullable = 0 THEN 1 ELSE 0 END) AS not_null_columns,
SUM(CASE WHEN is_indexed = 1 THEN 1 ELSE 0 END) AS indexed_columns,
-- Calculate constraint coverage percentages
CAST(SUM(CASE WHEN is_nullable = 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS not_null_coverage_percentage
FROM #CompleteColumnAnalysis;
-- ============================================================================
-- SCHEMA-LEVEL BREAKDOWN ANALYSIS
-- ============================================================================
-- Detailed analysis grouped by schema for targeted governance efforts
SELECT
'=== SCHEMA-LEVEL ANALYSIS BREAKDOWN ===' AS breakdown_section,
schema_name,
-- ====================================================================
-- BASIC COVERAGE METRICS
-- ====================================================================
COUNT(*) AS total_columns, -- Total columns in this schema
COUNT(DISTINCT table_name) AS total_tables, -- Total tables in this schema
-- ====================================================================
-- QUALITY METRICS BY SCHEMA
-- ====================================================================
CAST(AVG(overall_column_score) AS DECIMAL(5,2)) AS avg_quality_score,
CAST(AVG(technical_quality_score) AS DECIMAL(5,2)) AS avg_technical_score,
CAST(AVG(structural_integrity_score) AS DECIMAL(5,2)) AS avg_structural_score,
CAST(AVG(naming_quality_score) AS DECIMAL(5,2)) AS avg_naming_score,
-- Quality level distribution by schema
SUM(CASE WHEN overall_quality_level = 'EXCELLENT' THEN 1 ELSE 0 END) AS excellent_quality_count,
SUM(CASE WHEN overall_quality_level = 'GOOD' THEN 1 ELSE 0 END) AS good_quality_count,
SUM(CASE WHEN overall_quality_level = 'FAIR' THEN 1 ELSE 0 END) AS fair_quality_count,
SUM(CASE WHEN overall_quality_level = 'POOR' THEN 1 ELSE 0 END) AS poor_quality_count,
-- ====================================================================
-- RISK METRICS BY SCHEMA
-- ====================================================================
SUM(CASE WHEN pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') THEN 1 ELSE 0 END) AS high_risk_count,
SUM(CASE WHEN pii_risk_assessment = 'CRITICAL_RISK' THEN 1 ELSE 0 END) AS critical_risk_count,
SUM(CASE WHEN governance_priority IN ('URGENT', 'HIGH') THEN 1 ELSE 0 END) AS priority_attention_needed,
-- Risk percentage within schema
CAST(SUM(CASE WHEN pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS high_risk_percentage,
-- ====================================================================
-- DATA TYPE DISTRIBUTION BY SCHEMA
-- ====================================================================
-- Shows what types of data each schema contains
SUM(CASE WHEN data_type_category = 'TEXT' THEN 1 ELSE 0 END) AS text_columns,
SUM(CASE WHEN data_type_category = 'INTEGER' THEN 1 ELSE 0 END) AS integer_columns,
SUM(CASE WHEN data_type_category = 'NUMERIC' THEN 1 ELSE 0 END) AS numeric_columns,
SUM(CASE WHEN data_type_category = 'TEMPORAL' THEN 1 ELSE 0 END) AS temporal_columns,
SUM(CASE WHEN data_type_category = 'BOOLEAN' THEN 1 ELSE 0 END) AS boolean_columns,
SUM(CASE WHEN data_type_category = 'IDENTIFIER' THEN 1 ELSE 0 END) AS identifier_columns,
-- ====================================================================
-- CONSTRAINT ANALYSIS BY SCHEMA
-- ====================================================================
-- Shows data integrity controls per schema
SUM(CASE WHEN is_primary_key = 1 THEN 1 ELSE 0 END) AS primary_key_columns,
SUM(CASE WHEN is_foreign_key = 1 THEN 1 ELSE 0 END) AS foreign_key_columns,
SUM(CASE WHEN is_nullable = 0 THEN 1 ELSE 0 END) AS not_null_columns,
SUM(CASE WHEN is_indexed = 1 THEN 1 ELSE 0 END) AS indexed_columns,
SUM(CASE WHEN has_default_constraint = 1 THEN 1 ELSE 0 END) AS default_constraint_columns,
-- Constraint coverage percentage for this schema
CAST(SUM(CASE WHEN is_nullable = 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS not_null_coverage_pct,
-- ====================================================================
-- BUSINESS CRITICALITY BY SCHEMA
-- ====================================================================
SUM(CASE WHEN business_criticality = 'CRITICAL' THEN 1 ELSE 0 END) AS critical_business_columns,
SUM(CASE WHEN business_criticality = 'HIGH' THEN 1 ELSE 0 END) AS high_business_columns,
CAST(AVG(business_value_score) AS DECIMAL(5,2)) AS avg_business_value_score
FROM #CompleteColumnAnalysis
GROUP BY schema_name
-- Order by schemas needing most attention first
ORDER BY
high_risk_count DESC, -- Schemas with most high-risk columns first
avg_quality_score ASC, -- Then by lowest quality scores
schema_name; -- Finally alphabetical
-- ============================================================================
-- DATA TYPE ANALYSIS DASHBOARD
-- ============================================================================
-- Analysis of data type usage patterns and appropriateness
SELECT
'=== DATA TYPE DISTRIBUTION AND QUALITY ANALYSIS ===' AS data_type_section,
data_type_category,
-- ====================================================================
-- DATA TYPE USAGE STATISTICS
-- ====================================================================
COUNT(*) AS total_columns, -- How many columns use this data type category
COUNT(DISTINCT schema_name) AS schemas_using, -- How many schemas use this type
COUNT(DISTINCT schema_name + '.' + table_name) AS tables_using, -- How many tables use this type
-- ====================================================================
-- QUALITY METRICS BY DATA TYPE
-- ====================================================================
CAST(AVG(overall_column_score) AS DECIMAL(5,2)) AS avg_overall_score,
CAST(AVG(technical_quality_score) AS DECIMAL(5,2)) AS avg_technical_score,
CAST(AVG(structural_integrity_score) AS DECIMAL(5,2)) AS avg_structural_score,
CAST(AVG(naming_quality_score) AS DECIMAL(5,2)) AS avg_naming_score,
-- ====================================================================
-- DATA TYPE APPROPRIATENESS
-- ====================================================================
-- Shows how well data types are chosen for their purpose
CAST(AVG(CASE WHEN technical_quality_level = 'EXCELLENT' THEN 100
WHEN technical_quality_level = 'GOOD' THEN 85
WHEN technical_quality_level = 'FAIR' THEN 70
ELSE 40 END) AS DECIMAL(5,2)) AS avg_appropriateness_score,
-- ====================================================================
-- RISK ANALYSIS BY DATA TYPE
-- ====================================================================
SUM(CASE WHEN pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') THEN 1 ELSE 0 END) AS high_risk_columns,
CAST(SUM(CASE WHEN pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS high_risk_percentage,
-- ====================================================================
-- CONSTRAINT COVERAGE BY DATA TYPE
-- ====================================================================
SUM(CASE WHEN is_nullable = 0 THEN 1 ELSE 0 END) AS not_null_columns,
SUM(CASE WHEN is_primary_key = 1 THEN 1 ELSE 0 END) AS primary_key_columns,
SUM(CASE WHEN is_foreign_key = 1 THEN 1 ELSE 0 END) AS foreign_key_columns,
SUM(CASE WHEN is_indexed = 1 THEN 1 ELSE 0 END) AS indexed_columns,
-- Constraint coverage percentage for this data type
CAST(SUM(CASE WHEN is_nullable = 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS not_null_coverage_pct
FROM #CompleteColumnAnalysis
GROUP BY data_type_category
-- Order by most common data types first, then by risk level
ORDER BY
COUNT(*) DESC, -- Most common data types first
high_risk_percentage DESC; -- Then by risk level
-- ============================================================================
-- COMPLETE ALL COLUMNS REPORT - EVERY COLUMN IN DATABASE
-- ============================================================================
-- PURPOSE: This report displays EVERY column analyzed in the database
-- SCOPE: Universal coverage - no filtering applied, all columns included
-- USE CASE: Complete governance review, comprehensive audit, systematic analysis
-- OUTPUT: Detailed analysis of every single column with full metadata and recommendations
--
-- REPORT FEATURES:
-- - Every column in every table across all schemas
-- - Complete technical metadata and properties
-- - All quality scores and classifications
-- - Full risk and sensitivity analysis
-- - Comprehensive governance recommendations
-- - Business context and compliance information
-- - Sorted systematically for review (schema → table → column position)
--
-- INTENDED USERS:
-- - Data governance teams conducting comprehensive reviews
-- - Database administrators performing complete audits
-- - Compliance officers reviewing all data elements
-- - Data architects analyzing entire database structure
-- ============================================================================
SELECT
'=== COMPLETE ALL COLUMNS ANALYSIS - EVERY COLUMN ===' AS complete_section,
-- ====================================================================
-- COLUMN IDENTIFICATION SECTION
-- ====================================================================
-- Provides unique identification and positioning information for each column
-- Enables easy reference and navigation through the complete dataset
ROW_NUMBER() OVER (ORDER BY schema_name, table_name, ordinal_position) AS row_number, -- Sequential numbering for reference
schema_name, -- Database schema containing the table
table_name, -- Table containing the column
column_name, -- Name of the column
ordinal_position, -- Position of column within table (1, 2, 3, etc.)
full_column_name, -- Fully qualified name (database.schema.table.column)
-- ====================================================================
-- TECHNICAL DETAILS SECTION
-- ====================================================================
-- Complete technical metadata about the column's data type and characteristics
-- Essential for understanding storage, performance, and compatibility implications
data_type_category, -- High-level data type category (TEXT, INTEGER, NUMERIC, etc.)
base_data_type, -- SQL Server base data type (varchar, int, datetime, etc.)
formatted_data_type, -- Full type specification with length/precision (varchar(50), decimal(10,2))
character_set_name, -- Character encoding for text types (ASCII, UNICODE, or NULL)
-- ====================================================================
-- COLUMN PROPERTIES SECTION
-- ====================================================================
-- Binary properties converted to human-readable YES/NO format
-- Shows structural characteristics and constraints applied to each column
-- NULL CONSTRAINT ANALYSIS
-- Indicates whether column allows NULL values (important for data quality)
CASE WHEN is_nullable = 1 THEN 'YES' ELSE 'NO' END AS allows_nulls,
-- SPECIAL COLUMN TYPES
-- Identity columns are auto-incrementing, computed columns are formula-based
CASE WHEN is_identity = 1 THEN 'YES' ELSE 'NO' END AS is_identity,
CASE WHEN is_computed = 1 THEN 'YES' ELSE 'NO' END AS is_computed,
-- CONSTRAINT PARTICIPATION
-- Shows which types of constraints this column participates in
CASE WHEN is_primary_key = 1 THEN 'YES' ELSE 'NO' END AS is_primary_key, -- Uniquely identifies table rows
CASE WHEN is_foreign_key = 1 THEN 'YES' ELSE 'NO' END AS is_foreign_key, -- References another table
CASE WHEN is_unique_key = 1 THEN 'YES' ELSE 'NO' END AS is_unique_key, -- Unique values (non-PK)
-- PERFORMANCE AND INTEGRITY FEATURES
-- Indexes improve query performance, constraints ensure data quality
CASE WHEN is_indexed = 1 THEN 'YES' ELSE 'NO' END AS has_index, -- Has index for performance
CASE WHEN has_default_constraint = 1 THEN 'YES' ELSE 'NO' END AS has_default, -- Has default value
CASE WHEN has_check_constraint = 1 THEN 'YES' ELSE 'NO' END AS has_check_constraint, -- Has business rule validation
-- ====================================================================
-- QUALITY SCORES SECTION (ALL COLUMNS)
-- ====================================================================
-- Comprehensive quality assessment across multiple dimensions
-- All scores are on 0-100 scale for consistency and comparability
-- OVERALL QUALITY METRICS
-- Composite score combining all quality factors with weighted importance
overall_column_score, -- Primary quality indicator (weighted composite)
overall_quality_level, -- Categorical level: EXCELLENT/GOOD/FAIR/POOR
-- TECHNICAL QUALITY METRICS
-- Focuses on technical implementation and data type appropriateness
technical_quality_score, -- Technical implementation quality (0-100)
technical_quality_level, -- Technical quality category
-- STRUCTURAL QUALITY METRICS
-- Evaluates data integrity controls and relationship management
structural_integrity_score, -- Constraint coverage and relationships (0-100)
structural_quality_level, -- Structural quality category
-- NAMING QUALITY METRICS
-- Assesses naming convention adherence and descriptiveness
naming_quality_score, -- Naming convention quality (0-100)
naming_quality_level, -- Naming quality category
-- BUSINESS VALUE METRICS
-- Measures inferred business importance and operational relevance
business_value_score, -- Business importance and value (0-100)
-- ====================================================================
-- SENSITIVITY AND RISK ANALYSIS SECTION (ALL COLUMNS)
-- ====================================================================
-- Security, privacy, and compliance risk assessment for every column
-- Critical for determining protection requirements and access controls
-- DATA SENSITIVITY CLASSIFICATION
-- Overall sensitivity level based on content type and business context
data_sensitivity_category, -- HIGHLY_SENSITIVE/SENSITIVE/MODERATELY_SENSITIVE/LOW_SENSITIVITY/PUBLIC
-- PII (PERSONALLY IDENTIFIABLE INFORMATION) ANALYSIS
-- Automated detection of potential personal data for privacy compliance
potential_pii_indicator, -- Detected PII category (PERSONAL_NAME, CONTACT_INFO, etc.)
pii_risk_assessment, -- PII risk level (CRITICAL_RISK/HIGH_RISK/MEDIUM_RISK/LOW_RISK/MINIMAL_RISK)
-- BUSINESS AND COMPLIANCE CRITICALITY
-- Assessment of business importance and regulatory compliance requirements
business_criticality, -- Business importance (CRITICAL/HIGH/MEDIUM/STANDARD)
compliance_risk_level, -- Regulatory compliance risk (VERY_HIGH/HIGH/MEDIUM/LOW)
-- ====================================================================
-- GOVERNANCE RECOMMENDATIONS SECTION (ALL COLUMNS)
-- ====================================================================
-- Specific governance controls and requirements based on risk assessment
-- Provides actionable guidance for data protection and compliance
-- DATA CLASSIFICATION AND ACCESS CONTROL
-- Formal classification level determining handling requirements
data_classification, -- RESTRICTED/CONFIDENTIAL/SENSITIVE/BUSINESS_CRITICAL/INTERNAL
access_control_recommendation, -- HIGHLY_RESTRICTED/RESTRICTED/CONTROLLED/MANAGED/STANDARD
-- TECHNICAL PROTECTION REQUIREMENTS
-- Specific technical controls needed based on sensitivity and risk
encryption_recommendation, -- MANDATORY/REQUIRED/RECOMMENDED/NOT_REQUIRED
masking_recommendation, -- Data masking for non-production (REQUIRED/RECOMMENDED/NOT_REQUIRED)
monitoring_priority, -- Access monitoring level (CONTINUOUS/HIGH/MEDIUM/STANDARD)
-- GOVERNANCE PRIORITY AND WORKFLOW
-- Priority level for governance team attention and resource allocation
governance_priority, -- URGENT/HIGH/MEDIUM/LOW (determines work prioritization)
-- ====================================================================
-- BUSINESS CONTEXT SECTION (ALL COLUMNS)
-- ====================================================================
-- Business domain classification and operational context
-- Helps organize governance responsibilities and stewardship assignments
-- BUSINESS DOMAIN AND PURPOSE
-- Classification of business function and operational purpose
domain_category, -- SALES_REVENUE/CUSTOMER_DATA/EMPLOYEE_DATA/OPERATIONS/etc.
inferred_purpose, -- Likely business purpose based on analysis
data_lifecycle_stage, -- ACTIVE_TRANSACTIONAL/ACTIVE_REFERENCE/HISTORICAL_ARCHIVE/TEMPORARY
-- DATA STEWARDSHIP ASSIGNMENT
-- Recommended steward type based on domain and sensitivity
steward_assignment, -- SALES_DATA_STEWARD/CUSTOMER_DATA_STEWARD/PRIVACY_OFFICER/etc.
-- ====================================================================
-- COMPLIANCE AND REGULATORY SECTION (ALL COLUMNS)
-- ====================================================================
-- Legal and regulatory framework applicability
-- Essential for compliance planning and risk management
-- REGULATORY FRAMEWORK MAPPING
-- Applicable laws and regulations based on data type and business context
regulatory_scope, -- GDPR, CCPA, HIPAA, PCI-DSS, Employment Laws, etc.
retention_category, -- Data retention classification and requirements
-- ====================================================================
-- RECOMMENDATIONS SECTION (ALL COLUMNS)
-- ====================================================================
-- Specific, actionable guidance for each column
-- Drives governance activities and improvement initiatives
-- IMMEDIATE ACTION REQUIREMENTS
-- Urgent actions needed for high-risk or problematic columns
-- NULL indicates no immediate action required
ISNULL(immediate_actions, 'No immediate actions required') AS immediate_actions,
-- IMPROVEMENT RECOMMENDATIONS
-- Suggested enhancements for better governance, quality, or performance
recommended_improvements, -- Specific improvement suggestions
-- ====================================================================
-- ANALYSIS METADATA
-- ====================================================================
-- Information about when and how this analysis was performed
-- Useful for tracking analysis history and determining when refresh is needed
analysis_timestamp -- When this analysis was performed
FROM #CompleteColumnAnalysis
-- ====================================================================
-- NO FILTERING APPLIED - COMPLETE COVERAGE GUARANTEED
-- ====================================================================
-- IMPORTANT: This query includes EVERY column analyzed with no WHERE clause
-- This ensures complete coverage and comprehensive governance review
-- Use this report for:
-- - Complete database audits
-- - Comprehensive governance assessments
-- - Systematic quality reviews
-- - Full compliance evaluations
-- - Complete stewardship planning
-- ====================================================================
-- ORDERING STRATEGY FOR SYSTEMATIC REVIEW
-- ====================================================================
-- Results are ordered to facilitate systematic, comprehensive review
-- Grouping by schema and table enables logical progression through database
ORDER BY
schema_name, -- Group by schema for domain-based review
table_name, -- Then by table for entity-based analysis
ordinal_position; -- Finally by column position for table structure review
-- ============================================================================
-- TOP PRIORITY COLUMNS REQUIRING IMMEDIATE ATTENTION
-- ============================================================================
-- PURPOSE: Filtered subset focusing on columns requiring urgent governance attention
-- SCOPE: High-priority items only - columns with quality issues, risk concerns, or urgent needs
-- USE CASE: Action-oriented governance work, immediate attention items, resource prioritization
-- AUDIENCE: Data stewards, governance teams, administrators needing focused action lists
--
-- FILTERING CRITERIA:
-- - Governance priority: URGENT or HIGH
-- - Quality level: POOR (needs improvement)
-- - Risk level: CRITICAL_RISK or HIGH_RISK (security/privacy concerns)
-- - Business criticality: CRITICAL (business-critical data)
-- - Action required: Has immediate actions specified
--
-- REPORT PURPOSE:
-- - Prioritize governance work and resource allocation
-- - Focus attention on most critical data governance needs
-- - Provide action-oriented recommendations for immediate implementation
-- - Enable efficient triaging of governance workload
-- ============================================================================
SELECT
'=== TOP PRIORITY COLUMNS - IMMEDIATE ATTENTION REQUIRED ===' AS priority_section,
-- ====================================================================
-- COLUMN IDENTIFICATION
-- ====================================================================
schema_name,
table_name,
column_name,
data_type_category,
formatted_data_type,
ordinal_position, -- Position in table for context
-- ====================================================================
-- KEY QUALITY AND RISK METRICS
-- ====================================================================
overall_column_score,
overall_quality_level,
pii_risk_assessment,
governance_priority,
business_criticality,
-- ====================================================================
-- DETAILED RISK INDICATORS
-- ====================================================================
potential_pii_indicator, -- What type of PII detected
data_sensitivity_category, -- Sensitivity classification
compliance_risk_level, -- Regulatory risk level
-- ====================================================================
-- TECHNICAL QUALITY INDICATORS
-- ====================================================================
technical_quality_level,
structural_quality_level,
naming_quality_level,
-- ====================================================================
-- CONSTRAINT AND STRUCTURE ANALYSIS
-- ====================================================================
CASE WHEN is_nullable = 1 THEN 'YES' ELSE 'NO' END AS allows_nulls,
CASE WHEN is_primary_key = 1 THEN 'YES' ELSE 'NO' END AS is_primary_key,
CASE WHEN is_foreign_key = 1 THEN 'YES' ELSE 'NO' END AS is_foreign_key,
CASE WHEN is_indexed = 1 THEN 'YES' ELSE 'NO' END AS has_index,
CASE WHEN has_default_constraint = 1 THEN 'YES' ELSE 'NO' END AS has_default,
-- ====================================================================
-- GOVERNANCE RECOMMENDATIONS
-- ====================================================================
data_classification, -- Required data classification level
access_control_recommendation, -- Access control requirements
encryption_recommendation, -- Encryption requirements
masking_recommendation, -- Data masking requirements
monitoring_priority, -- Monitoring requirements
-- ====================================================================
-- BUSINESS CONTEXT
-- ====================================================================
domain_category, -- Business domain
inferred_purpose, -- Likely business purpose
steward_assignment, -- Suggested data steward
-- ====================================================================
-- ACTIONABLE RECOMMENDATIONS
-- ====================================================================
immediate_actions, -- Urgent actions needed
recommended_improvements, -- Suggested improvements
regulatory_scope -- Applicable regulations
FROM #CompleteColumnAnalysis
-- ====================================================================
-- FILTERING CRITERIA FOR HIGH PRIORITY ITEMS
-- ====================================================================
-- Only show columns that require governance attention or have quality issues
WHERE
governance_priority IN ('URGENT', 'HIGH') -- High governance priority items
OR overall_quality_level = 'POOR' -- Poor quality columns
OR pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') -- High-risk PII columns
OR business_criticality = 'CRITICAL' -- Business-critical columns
OR immediate_actions IS NOT NULL -- Columns needing immediate action
-- ====================================================================
-- ORDERING FOR MAXIMUM IMPACT
-- ====================================================================
-- Order by priority so most urgent items appear first
ORDER BY
-- Primary sort: Governance priority (most urgent first)
CASE governance_priority
WHEN 'URGENT' THEN 1
WHEN 'HIGH' THEN 2
WHEN 'MEDIUM' THEN 3
ELSE 4
END,
-- Secondary sort: PII risk level (most risky first)
CASE pii_risk_assessment
WHEN 'CRITICAL_RISK' THEN 1
WHEN 'HIGH_RISK' THEN 2
WHEN 'MEDIUM_RISK' THEN 3
ELSE 4
END,
-- Tertiary sort: Overall quality (worst first)
overall_column_score ASC,
-- Final sorts: Schema, table, column for consistency
schema_name,
table_name,
ordinal_position;
-- ============================================================================
-- STEWARD ASSIGNMENT SUMMARY
-- ============================================================================
-- Workload distribution for data steward teams
SELECT
'=== DATA STEWARD WORKLOAD DISTRIBUTION ===' AS steward_section,
steward_assignment,
-- Workload metrics
COUNT(*) AS total_columns_assigned,
SUM(CASE WHEN governance_priority IN ('URGENT', 'HIGH') THEN 1 ELSE 0 END) AS high_priority_items,
SUM(CASE WHEN pii_risk_assessment IN ('CRITICAL_RISK', 'HIGH_RISK') THEN 1 ELSE 0 END) AS high_risk_items,
SUM(CASE WHEN overall_quality_level = 'POOR' THEN 1 ELSE 0 END) AS poor_quality_items,
-- Schema distribution for each steward type
COUNT(DISTINCT schema_name) AS schemas_involved,
COUNT(DISTINCT schema_name + '.' + table_name) AS tables_involved,
-- Average quality scores for steward areas
CAST(AVG(overall_column_score) AS DECIMAL(5,2)) AS avg_quality_score,
CAST(AVG(business_value_score) AS DECIMAL(5,2)) AS avg_business_value_score
FROM #CompleteColumnAnalysis
GROUP BY steward_assignment
ORDER BY
high_priority_items DESC, -- Stewards with most urgent work first
total_columns_assigned DESC; -- Then by total workload
PRINT '============================================================================';
PRINT 'COMPLETE ANALYSIS AVAILABLE IN #CompleteColumnAnalysis TABLE';
PRINT 'Use SELECT * FROM #CompleteColumnAnalysis for full details';
PRINT 'Use WHERE clauses to filter by specific criteria (schema, quality, risk, etc.)';
PRINT '============================================================================';
Documentation
These documents are specifically required for demonstrating due diligence in data classification, meeting regulatory compliance obligations, and implementing risk-appropriate controls based on sensitivity analysis results:
Executive Summary
Analysis Overview Report - High-level metrics showing total columns analyzed, quality distribution, risk assessment summary, and key findings for leadership review.
Risk Assessment Dashboard - Summary of PII detection results, sensitivity classifications, and compliance risk levels across all schemas and tables.
Quality Metrics Summary - Overall database health indicators including average trust scores, constraint coverage, and areas needing improvement.
Classification Documentation
Data Classification Matrix - Formal classification levels assigned to each column (Restricted, Confidential, Sensitive, Internal, Public) with automated detection methodology and manual review processes.
PII Detection Report - Detailed findings of personally identifiable information including detection patterns, confidence scores, risk classifications, and false positive/negative analysis.
Sensitivity Assessment Results - Column-by-column sensitivity categorization with business context, risk justification, and escalation criteria for high-sensitivity data.
Risk & Compliance Documentation
Risk Assessment Dashboard - Summary of data exposure risks by sensitivity level, including critical/high-risk data locations and vulnerability assessment.
Regulatory Mapping Report - Links between classified data elements and specific regulatory requirements (GDPR Article 6 vs Article 9, CCPA personal information categories, HIPAA PHI classifications).
Data Protection Impact Assessment (DPIA) - Required documentation for high-risk personal data processing under GDPR, based on classification results.
Access & Security Documentation
Access Control Requirements Matrix - Detailed access restrictions, encryption mandates, and technical controls required for each classification level.
Data Masking Specifications - Requirements for anonymizing/pseudonymizing sensitive data in non-production environments based on classification.
Monitoring and Audit Requirements - Specific logging, access monitoring, and audit trail requirements for each sensitivity category.
Validation & Quality Documentation
Classification Accuracy Report - Validation of automated classification results including manual review findings, accuracy metrics, and refinement recommendations.
Sensitivity Escalation Procedures - Process documentation for handling misclassified data, sensitivity level changes, and classification disputes.
False Positive/Negative Analysis - Assessment of classification errors with pattern refinement and algorithm improvement recommendations.