Data Discovery

SchemaCrawler

Download SchemaCrawler:

Install SDKMan:

sudo curl -s "https://get.sdkman.io" | bash

source "/home/pdc/.sdkman/bin/sdkman-init.sh"

To test the installation, run:

sdk help

Install SchemaCrawler:

sdk install schemacrawler

Run SchemaCrawler & connect to AW database to generate a detailed schema diagram:

schemacrawler.sh \
  --server=sqlserver \
  --host=pdc.pentaho.lab \
  --port=1433 \
  --database=AdventureWorks2022 \
  --user=sa \
  --password=StrongPassword123 \
  --info-level=maximum \
  --command=schema \
  --schemas=".*\.(HumanResources|Person|Production|Purchasing|Sales)" \
  --output-format=png \
  --output-file=adventureworks2022_all_schemas.png

You may want to split this up into individual schemas..!

For HR schema:

schemacrawler.sh \
  --server=sqlserver \
  --host=pdc.pentaho.lab \
  --port=1433 \
  --database=AdventureWorks2022 \
  --user=sa \
  --password=StrongPassword123 \
  --info-level=maximum \
  --command=schema \
  --schemas=".*\.HumanResources" \
  --output-format=png \
  --output-file=adventureworks2022_hr_schema.png

Other Useful Commands

# Person schema only
--schemas=".*\.Person"
--output-file=adventureworks2022_person_schema.png

# Production schema only  
--schemas=".*\.Production"
--output-file=adventureworks2022_production_schema.png

# Sales schema only
--schemas=".*\.Sales"
--output-file=adventureworks2022_sales_schema.png

# Purchasing schema only
--schemas=".*\.Purchasing"
--output-file=adventureworks2022_purchasing_schema.png

# HTML report
--output-format=html --output-file=schema.html

# JSON output
--output-format=json --output-file=schema.json

# CSV data
--output-format=csv --output-file=schema.csv

# Text report
--output-format=text --output-file=schema.txt

# SVG diagram (vector graphics)
--output-format=svg --output-file=schema.svg

# PDF diagram
--output-format=pdf --output-file=schema.pdf

Schema Analysis:

# Basic schema info
--command=brief

# Detailed schema with all metadata
--command=details

# List all objects
--command=list

Data Analysis:

# Count rows in all tables
--command=count --output-file=row_counts.txt

# Dump sample data from tables
--command=dump --output-file=sample_data.txt

# Quick data dump (faster, no ordering)
--command=quickdump --output-file=quick_data.txt

Schema Quality Analysis:

# Find schema design issues
--command=lint --output-file=schema_lint_report.txt

Complete Documentation Package:

# Generate comprehensive HTML documentation
schemacrawler.sh \
  --server=sqlserver \
  --host=pdc.pentaho.lab \
  --port=1433 \
  --database=AdventureWorks2022 \
  --user=sa \
  --password=StrongPassword123 \
  --info-level=maximum \
  --command=details \
  --schemas=".*\.(HumanResources|Person|Production|Purchasing|Sales)" \
  --output-format=html \
  --output-file=adventureworks_complete_docs.html

Data Quality Report:

# Find potential schema issues
schemacrawler.sh \
  --server=sqlserver \
  --host=pdc.pentaho.lab \
  --port=1433 \
  --database=AdventureWorks2022 \
  --user=sa \
  --password=StrongPassword123 \
  --info-level=maximum \
  --command=lint \
  --schemas=".*\.(HumanResources|Person|Production|Purchasing|Sales)" \
  --output-file=schema_quality_report.txt

Project Installation

Workflow is:

Copy: Workshop → Projects (using setup.sh) Configure: Update .env, download drivers, configure settings Install: Run install.sh to set up the python environment Test: Activate venv and test connection

If you cloned the Workshop--Data-Catalog repository, check that the Projects/data-discovery/ setup.sh is executable.

chmod +x /home/pdc/Workshop--Data-Catalog/Projects/data-discovery/setup.sh

Run the setup.sh.

# Run the setup script
cd
cd ~/Workshop-Data-Catalog/Projects/data-discovery
./setup.sh

The script will create all the directories and copy over all the supporting application files.

Configure the application configuration files.

cd
cd ~/Projects/Data_Discovery

# Configure environment
nano .env  # Update database credentials

# Download JDBC drivers
# Place them in ./drivers/ directory

# Update configuration (if needed)
nano config/config.yaml

example .env:

# Environment Variables for Data Discovery
# Copy this file to .env and update with your actual values

# AdventureWorks Database Connection
AW_HOST=localhost
AW_PORT=1433
AW_DATABASE=AdventureWorks2022
AW_USERNAME=sa
AW_PASSWORD=StrongPassword123

# General Database Connection (for other databases)
DB_HOST=localhost
DB_PORT=1433
DB_DATABASE=AdventureWorks2022
DB_USERNAME=sa
DB_PASSWORD=StrongPassword123
DB_DRIVER=sqlserver

# JDBC Driver Path (optional)
JDBC_DRIVER_PATH=./drivers/

# Output Configuration
OUTPUT_DIR=./reports
LOG_DIR=./logs

# API Keys (if needed for external services)
# ENCRYPTION_KEY=your_encryption_key
# API_KEY=your_api_key

example config/config.yaml:

# =============================================================================
# Data Discovery Configuration File
# =============================================================================
# Complete configuration matching README.md documentation
# Update with your specific database and environment settings
#
# This file controls all aspects of the Data Discovery application including:
# - Database connectivity and performance
# - PII classification rules and thresholds
# - Trust scoring methodology (Industry standard)
# - Dashboard behavior and caching
# - Business domain mapping and analytics
# =============================================================================

# =============================================================================
# DATABASE CONFIGURATION
# =============================================================================
# Controls how the application connects to and interacts with your database
database:
  # Basic connection settings - Update these for your database
  host: "localhost"                    # Database server hostname or IP
  port: 1433                          # Database port (1433 for SQL Server)
  database: "AdventureWorks2022"      # Database name to analyze
  username: "sa"                      # Database username
  password: "${DB_PASSWORD}"          # Use environment variable for security
  driver: "sqlserver"                 # Database type: sqlserver, postgresql, mysql, oracle
  
  # Connection management and timeouts
  connection_timeout: 30              # Seconds to wait for initial connection
  query_timeout: 300                  # Seconds to wait for query completion (5 minutes)
  max_connections: 10                 # Maximum concurrent database connections
  
  # Performance optimization for large databases
  batch_size: 1000                    # Number of tables to process in each batch
  parallel_workers: 4                 # Number of parallel processing threads
  
  # JDBC-specific connection parameters
  jdbc_url: null                      # Custom JDBC URL (overrides host/port if set)
  additional_params:
    TrustServerCertificate: "true"    # Accept self-signed certificates
    IntegratedSecurity: "false"       # Use SQL Server authentication

# =============================================================================
# DISCOVERY CONFIGURATION
# =============================================================================
# Controls what data is discovered and how the analysis is performed
discovery:
  # Database schemas to include in discovery
  # Leave empty ([]) to discover all schemas, or specify specific ones
  schemas: 
    - "Sales"                      # Sales and customer transactions
    - "Production"                 # Manufacturing and inventory data
    - "HumanResources"             # Employee and HR information
    - "Purchasing"                 # Vendor and procurement data
    - "Person"                     # Customer and contact information
  
  # Data sampling for analysis (affects performance vs. accuracy)
  sample_size: 100                    # Number of rows to sample per table for pattern analysis
  
  # Usage pattern analysis timeframe
  usage_analysis_days: 30             # Days of historical usage data to analyze

# =============================================================================
# CLASSIFICATION CONFIGURATION
# =============================================================================
# Controls PII detection and data classification using Pentaho methodology
classification:
  # Classification accuracy vs. performance trade-off
  confidence_threshold: 0.7           # Minimum confidence score (0.0-1.0) to classify as PII
  
  # Rule engine settings
  enable_custom_rules: true           # Enable user-defined classification patterns
  enable_adventureworks_rules: true   # Enable AdventureWorks-specific classification rules
  
  # Pentaho standard PII categories to detect
  # These align with industry-standard privacy frameworks (GDPR, CCPA, etc.)
  pii_categories: 
    - "GOVERNMENT_ID"                 # SSN, passport, national ID, tax ID
    - "PERSONAL_NAME"                 # First, last, full names, display names
    - "CONTACT_INFO"                  # Email addresses, phone numbers
    - "ADDRESS_INFO"                  # Street, city, zip, postal codes
    - "DEMOGRAPHIC"                   # Birth date, gender, age, ethnicity
    - "FINANCIAL"                     # Salary, credit cards, bank accounts
    - "HEALTH"                        # Medical records, diagnosis, treatment
    - "AUTHENTICATION"                # Passwords, secrets, tokens
    - "BIOMETRIC"                     # Fingerprints, photos, biometric data
  
  # Performance optimization
  cache_results: true                 # Cache classification results to improve performance
  
  # Custom classification patterns
  custom_patterns:
    - name: "Custom SSN Pattern"
      category: "GOVERNMENT_ID"
      sensitivity: "HIGH"
      column_patterns: ["ssn", "social_security"]
      data_patterns: ["^\\d{3}-?\\d{2}-?\\d{4}$"]
    - name: "Email Pattern"
      category: "CONTACT_INFO"
      sensitivity: "MEDIUM"
      column_patterns: ["email", "mail"]
      data_patterns: ["^[\\w\\.-]+@[\\w\\.-]+\\.[a-zA-Z]{2,}$"]

# =============================================================================
# TRUST SCORING CONFIGURATION (Industry Methodology)
# =============================================================================
# Implements Industry-standard trust scoring methodology
# Reference: https://academy.pentaho.com/pentaho-data-catalog/use-case/adventure-works/
trust_scoring:
  # Industry methodology component weights (must sum to 1.0)
  # These weights determine how different factors contribute to the overall trust score
  business_value_weight: 0.30         # Business importance and usage (30%)
  data_type_weight: 0.25              # Data type appropriateness and consistency (25%)
  constraints_weight: 0.25            # Structural integrity and constraints (25%)
  naming_weight: 0.20                 # Naming conventions and descriptiveness (20%)
  
  # Trust score classification thresholds (0-100 scale)
  excellent_threshold: 75             # Excellent governance (75-100) - Green
  good_threshold: 65                  # Good governance (65-74) - Light Green
  fair_threshold: 50                  # Fair governance (50-64) - Yellow
  # Below 50 = Poor governance - Red
  
  # Risk level assessment thresholds
  critical_threshold: 40              # Critical risk (0-39) - Immediate attention required
  high_threshold: 60                  # High risk (40-59) - Priority remediation
  medium_threshold: 75                # Medium risk (60-74) - Monitor and improve
  # Above 75 = Low risk - Well governed

# =============================================================================
# DASHBOARD CONFIGURATION
# =============================================================================
# Controls the web dashboard behavior, performance, and user experience
dashboard:
  # Caching settings for improved performance
  enable_caching: true                # Enable server-side caching of dashboard data
  cache_duration: 300                 # Cache lifetime in seconds (5 minutes)
  
  # User interface display limits
  max_results_per_page: 100           # Maximum items per page in filtered views
  max_records_display: 10000          # Maximum total records to display in UI
  
  # Performance optimization
  enable_compression: true            # Compress API responses to reduce bandwidth

# =============================================================================
# COMPLIANCE CONFIGURATION
# =============================================================================
# Defines regulatory frameworks and compliance assessment parameters
compliance:
  # Regulatory frameworks to assess against
  # The system will evaluate data handling practices against these standards
  regulations:
    - "GDPR"                          # General Data Protection Regulation (EU)
    - "CCPA"                          # California Consumer Privacy Act (US)
    - "HIPAA"                         # Health Insurance Portability and Accountability Act (US)
    - "SOX"                           # Sarbanes-Oxley Act (US)
    - "PCI_DSS"                       # Payment Card Industry Data Security Standard
  
  # Compliance monitoring schedule
  assessment_frequency: 90            # Days between automated compliance assessments

# =============================================================================
# OUTPUT CONFIGURATION
# =============================================================================
# Controls report generation, file formats, and export behavior
output:
  # File system settings
  directory: "./reports"              # Directory for generated reports and exports
  
  # Supported export formats
  formats:
    - "csv"                          # Comma-separated values (data analysis)
    - "json"                         # JavaScript Object Notation (API integration)
    - "xlsx"                         # Excel workbook (business users)
    - "pdf"                          # Portable Document Format (executive reports)
  
  # Data inclusion settings
  include_samples: true               # Include sample data values in classification reports
  
  # File naming convention
  timestamp_format: "%Y%m%d_%H%M%S"   # Format: YYYYMMDD_HHMMSS

# =============================================================================
# LOGGING CONFIGURATION
# =============================================================================
# Controls application logging, debugging, and troubleshooting capabilities
logging:
  # Log verbosity level
  # DEBUG: Detailed diagnostic information
  # INFO: General operational messages (recommended)
  # WARNING: Important warnings and issues
  # ERROR: Only error conditions
  level: "INFO"
  
  # Log file location
  file: "./logs/data_discovery.log"   # Main application log file
  
  # Log rotation to prevent disk space issues
  max_file_size: "10MB"               # Maximum size before rotation
  backup_count: 5                     # Number of backup log files to keep

# =============================================================================
# BUSINESS DOMAIN MAPPING
# =============================================================================
# Maps database schemas to business domains and defines their relative importance
business_domains:
  # Schema to business domain classification
  # This mapping helps contextualize data within business functions
  schema_mappings:
    Sales: "SALES"                    # Sales transactions and customer relationships
    Production: "MANUFACTURING"       # Manufacturing processes and inventory
    HumanResources: "HR"              # Employee data and human resources
    Purchasing: "PROCUREMENT"         # Vendor relationships and procurement
    Person: "CUSTOMER_DATA"           # Customer profiles and contact information
  
  # Business criticality weights (1-10 scale)
  # Higher weights indicate more business-critical data requiring stronger governance
  importance_weights:
    SALES: 10.0                       # Highest priority - revenue generating
    CUSTOMER_DATA: 9.0                # High priority - customer relationships
    HR: 9.0                           # High priority - employee information
    MANUFACTURING: 8.0                # Important - operational data
    PROCUREMENT: 7.0                  # Important - vendor relationships
    DEFAULT: 6.0                      # Standard priority for unmapped schemas

# =============================================================================
# ADVANCED ANALYTICS CONFIGURATION
# =============================================================================
# Controls advanced data quality analysis and usage pattern tracking
analytics:
  # Feature enablement
  enable_advanced_analytics: true     # Enable comprehensive data quality analysis
  
  # Data quality assessment thresholds (0.0-1.0 scale)
  quality_threshold: 0.8              # Minimum quality score for "good" data
  completeness_threshold: 0.9         # Minimum completeness for "complete" data
  
  # Usage pattern tracking and analysis
  track_usage_patterns: true          # Monitor and analyze data access patterns
  usage_retention_days: 365           # Days of usage history to retain for analysis

Run install.sh - UV python environment.

cd
cd ~/Projects/Data-Discovery
./install.sh

Troubleshooting

Executive Overview

Server health check

curl -s http://localhost:6001/api/health | python3 -c "import json, sys; data = json.load(sys.stdin); print('Dashboard Status:', data.get('status', 'Unknown'))"

echo "=== COMPREHENSIVE SYSTEM VERIFICATION ===" && curl -s http://localhost:6001/api/analytics | python3 -c "import json, sys; data = json.load(sys.stdin); print(' Analytics API: ✅ Working'); print(' Total Tables:', data.get('inventory', {}).get('total_tables', 'N/A')); print(' Total Columns:', data.get('classification', {}).get('total_columns', 'N/A')); print(' Classified Columns:', data.get('classification', {}).get('classified_columns', 'N/A')); print('⭐ Database Trust Score:', data.get('database_trust_score', 'N/A')); print(' Governance Level:', data.get('governance_level', 'N/A')); print(' Trust Score Levels:', len(data.get('trust_scores', {}))); print('️ Classification Categories:', len(data.get('classification', {}).get('category_distribution', {}))); print('✅ DAMA-DMBOK Integration: COMPLETE')"

Start server

cd
cd ~/Projects/Data_Discovery
python3 dashboard/server.py --port 6001

bash -lc 'for i in {1..20}; do STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:6001/healthz || true); if [ "$STATUS" = "200" ]; then echo READY; exit 0; fi; sleep 0.5; done; echo NOT_READY'

Restart server

PID=$(lsof -iTCP:6001 -sTCP:LISTEN -t 2>/dev/null || true); if [ -n "$PID" ]; then echo "Killing PID $PID (port 6001)"; kill $PID || true; sleep 1; fi; echo "Done."

PreviousAdventure Works NextAW Data Sources

Last updated 24 days ago

Was this helpful?

SchemaCrawler

Other Useful Commands

DBeaver

Project Installation