MLflow

MLflow is an open-source platform for managing the machine learning lifecycle. It's designed to help data scientists and ML engineers track, reproduce, and deploy machine learning models more effectively.

MLflow contains the following components.

MLflow Tracking - An engineer will use this feature the most. It allows experiments to be recorded and queried. It also keeps track of the code, data, configuration and results for each experiment.
MLflow Projects - Allows experiments to be reproduced by packaging the code into a platform agnostic format.
MLflow Models - Deploys machine learning models to an environment where they can be served.
MLflow Repositories - Allows for the storage, annotation, discovery, and management of models in a central repository.

Setup

The setup.sh script will get the ball rolling ..

Execute the following script - setup.sh:

cd
cd Workshop--Data-Catalog/MLflow
./setup.sh

Directory at runtime:

Take a look at the configuration files:

Docker & Docker-Compose

Review the docker-compose.yml.

Here's the docker-compose.yml file:

# MLflow Docker Compose Configuration with Pentaho Data Catalog Integration
# This setup provides a complete MLflow environment with:
# - PostgreSQL backend for metadata storage and Model Registry
# - MinIO S3-compatible storage for artifacts
# - MLflow tracking server with Model Registry enabled
# - Pentaho Data Catalog integration support

services:
  # ===================================================================================
  # PostgreSQL Database Service
  # Stores MLflow metadata, experiments, runs, parameters, metrics, and Model Registry
  # ===================================================================================
  db:
    restart: always                              # Always restart container if it stops
    image: postgres:13                           # PostgreSQL 13 - stable version compatible with MLflow
    container_name: mlflow_db                    # Fixed container name for easy reference
    
    # Network exposure configuration
    expose:
      - "${PG_PORT}"                            # Expose port to other containers in network
    ports:
      - "${PG_PORT}:5432"                       # Map external port (from .env) to internal port 5432
                                                # External: for host/PDC connections, Internal: standard PostgreSQL port
    
    networks:
      - backend                                 # Connect to backend network (isolated from external access)
    
    # PostgreSQL configuration via environment variables
    environment:
      - POSTGRES_USER=${PG_USER}                # Database username (from .env file)
      - POSTGRES_PASSWORD=${PG_PASSWORD}        # Database password (from .env file) 
      - POSTGRES_DB=${PG_DATABASE}              # Initial database name to create (from .env file)
    
    # Data persistence and initialization
    volumes:
      - db_data:/var/lib/postgresql/data/       # Persistent volume for database data
      - ./init-db.sql:/docker-entrypoint-initdb.d/init-db.sql  # Initialize MLflow registry tables and permissions
    
    # Health monitoring - checks if PostgreSQL is ready to accept connections
    healthcheck:
      test: ["CMD", "pg_isready", "-p", "5432", "-U", "${PG_USER}"]  # Uses internal port 5432
      interval: 5s                              # Check every 5 seconds
      timeout: 5s                               # Wait max 5 seconds for response
      retries: 3                                # Retry 3 times before marking unhealthy

  # ===================================================================================
  # MinIO S3-Compatible Object Storage Service
  # Stores MLflow artifacts (models, plots, files, datasets)
  # ===================================================================================
  s3:
    restart: always                              # Always restart container if it stops
    image: minio/minio:RELEASE.2025-04-22T22-12-26Z  # Specific MinIO version with full admin UI
    container_name: mlflow_minio                 # Fixed container name for easy reference
    
    # Data persistence
    volumes:
      - minio_data:/data                        # Persistent volume for object storage data
    
    # Network exposure configuration
    ports:
      - "${MINIO_PORT}:9000"                    # MinIO API port (S3-compatible interface)
      - "${MINIO_CONSOLE_PORT}:9001"            # MinIO web console port (admin UI)
    
    networks:
      - frontend                                # Connect to frontend (for web console access)
      - backend                                 # Connect to backend (for MLflow server access)
    
    # MinIO configuration via environment variables
    environment:
      - MINIO_ROOT_USER=${MINIO_ROOT_USER}      # MinIO admin username (from .env file)
      - MINIO_ROOT_PASSWORD=${MINIO_ROOT_PASSWORD}  # MinIO admin password (from .env file)
      - MINIO_ADDRESS=${MINIO_ADDRESS}          # Internal MinIO server address
      - MINIO_PORT=${MINIO_PORT}                # MinIO API port
      - MINIO_STORAGE_USE_HTTPS=${MINIO_STORAGE_USE_HTTPS}  # Enable/disable HTTPS
      - MINIO_CONSOLE_ADDRESS=${MINIO_CONSOLE_ADDRESS}      # Console web interface address
    
    # MinIO server startup command with console configuration
    command: server /data --console-address ":9001"  # Start MinIO server with web console on port 9001
    
    # Health monitoring - checks if MinIO API is responding
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]  # Internal health endpoint
      interval: 30s                             # Check every 30 seconds
      timeout: 20s                              # Wait max 20 seconds for response
      retries: 3                                # Retry 3 times before marking unhealthy

  # ===================================================================================
  # MinIO Bucket Initialization Service
  # One-time service that creates the MLflow bucket and sets permissions
  # ===================================================================================
  createbuckets:
    image: minio/mc                             # MinIO client for bucket management
    depends_on:
      - s3                                      # Wait for MinIO service to start
    networks:
      - backend                                 # Connect to backend network to access MinIO
    
    # Initialization script that runs once and exits
    entrypoint: >
      /bin/sh -c "
      /usr/bin/mc alias set myminio http://s3:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD};
      /usr/bin/mc mb myminio/${MLFLOW_BUCKET_NAME};
      /usr/bin/mc policy set public myminio/${MLFLOW_BUCKET_NAME};
      exit 0;
      "
    # Script breakdown:
    # 1. Create alias 'myminio' pointing to MinIO server with credentials
    # 2. Create bucket with name from MLFLOW_BUCKET_NAME environment variable
    # 3. Set bucket policy to public (allows MLflow to access artifacts)
    # 4. Exit successfully (container will show as 'Exited 0')

  # ===================================================================================
  # MLflow Tracking Server
  # Main MLflow service with Model Registry, artifact serving, and web UI
  # Configured for Pentaho Data Catalog integration
  # ===================================================================================
  tracking_server:
    restart: always                              # Always restart container if it stops
    build: ./mlflow                             # Build custom image from ./mlflow directory (includes Pentaho integration)
    image: mlflow_server                        # Tag for the built image
    container_name: mlflow_server               # Fixed container name for easy reference
    
    # Service dependencies with health checks
    depends_on:
      db:
        condition: service_healthy              # Wait for PostgreSQL to be healthy
      s3:
        condition: service_healthy              # Wait for MinIO to be healthy  
      createbuckets:
        condition: service_completed_successfully  # Wait for bucket creation to complete
    
    # Network exposure
    ports:
      - "${MLFLOW_PORT}:5000"                   # Map external port (from .env) to internal MLflow port 5000
    
    networks:
      - frontend                                # Connect to frontend (for web UI access)
      - backend                                 # Connect to backend (for database and MinIO access)
    
    # Environment variables for MLflow server configuration
    environment:
      # Database connection variables (passed to entrypoint script)
      - PG_USER=${PG_USER}                      # PostgreSQL username
      - PG_PASSWORD=${PG_PASSWORD}              # PostgreSQL password
      - PG_DATABASE=${PG_DATABASE}              # PostgreSQL database name
      - PG_PORT=${PG_PORT}                      # PostgreSQL external port (for reference)
      
      # MinIO/S3 configuration for artifact storage
      - AWS_ACCESS_KEY_ID=${MINIO_ACCESS_KEY}   # MinIO access key (S3-compatible)
      - AWS_SECRET_ACCESS_KEY=${MINIO_SECRET_ACCESS_KEY}  # MinIO secret key (S3-compatible)
      - MLFLOW_S3_ENDPOINT_URL=http://s3:${MINIO_PORT}    # Internal MinIO endpoint URL
      - MLFLOW_S3_IGNORE_TLS=true               # Disable TLS for internal MinIO communication
      
      # MLflow-specific configuration
      - MLFLOW_REGISTRY_URI=${MLFLOW_REGISTRY_URI}         # Model Registry database connection
      - MLFLOW_DEFAULT_ARTIFACT_ROOT=${MLFLOW_DEFAULT_ARTIFACT_ROOT}  # Default artifact storage location
      
      # Pentaho Data Catalog integration configuration (optional - for logging/monitoring)
      # PDC connects TO MLflow, not the other way around
      - PENTAHO_DATA_CATALOG_URL=${PENTAHO_DATA_CATALOG_URL}           # PDC server URL
      - PENTAHO_DATA_CATALOG_USERNAME=${PENTAHO_DATA_CATALOG_USERNAME} # PDC username
      - PENTAHO_DATA_CATALOG_PASSWORD=${PENTAHO_DATA_CATALOG_PASSWORD} # PDC password
    
    # MLflow server startup command with full configuration
    command: >
      mlflow server
      --backend-store-uri postgresql://${PG_USER}:${PG_PASSWORD}@db:5432/${PG_DATABASE}
      --registry-store-uri postgresql://${PG_USER}:${PG_PASSWORD}@db:5432/${PG_DATABASE}
      --default-artifact-root s3://${MLFLOW_BUCKET_NAME}
      --host 0.0.0.0
      --port 5000
      --serve-artifacts
      --artifacts-destination s3://${MLFLOW_BUCKET_NAME}
    # Command breakdown:
    # --backend-store-uri: PostgreSQL connection for experiment metadata (uses internal port 5432)
    # --registry-store-uri: PostgreSQL connection for Model Registry (same as backend for simplicity)
    # --default-artifact-root: Default S3 bucket for storing artifacts
    # --host 0.0.0.0: Bind to all interfaces (allows external connections)
    # --port 5000: Internal MLflow server port
    # --serve-artifacts: Enable artifact serving through MLflow server
    # --artifacts-destination: S3 bucket for artifact uploads
    
    # Health monitoring - checks if MLflow web server is responding
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:5000/health"]  # Internal health endpoint
      interval: 30s                             # Check every 30 seconds
      timeout: 10s                              # Wait max 10 seconds for response
      retries: 3                                # Retry 3 times before marking unhealthy

# ===================================================================================
# Persistent Data Volumes
# These volumes persist data across container restarts and recreations
# ===================================================================================
volumes:
  db_data:                                      # PostgreSQL data persistence
    # Stores database files, ensures data survives container recreation
  minio_data:                                   # MinIO object storage persistence  
    # Stores artifact files, ensures artifacts survive container recreation

# ===================================================================================
# Docker Networks
# Separate networks for security and traffic management
# ===================================================================================
networks:
  frontend:                                     # External-facing network
    driver: bridge                              # Standard Docker bridge network
    # Used for: Web UI access, MinIO console access
    
  backend:                                      # Internal network for service communication
    driver: bridge                              # Standard Docker bridge network  
    # Used for: Database connections, MinIO API access, internal service communication
    # More secure as it's isolated from external access

# ===================================================================================
# Architecture Summary:
# 
# External Access:
# - MLflow UI: http://host:5000
# - MinIO Console: http://host:9001  
# - PostgreSQL: host:5435 (for external tools like Pentaho Data Catalog)
#
# Internal Communication (Docker network):
# - MLflow ↔ PostgreSQL: db:5432
# - MLflow ↔ MinIO: s3:9000
# - Bucket creation ↔ MinIO: s3:9000
#
# Data Flow:
# 1. MLflow stores metadata (experiments, runs, params, metrics) in PostgreSQL
# 2. MLflow stores artifacts (models, files, plots) in MinIO S3 buckets
# 3. Model Registry metadata stored in same PostgreSQL database
# 4. Pentaho Data Catalog connects externally to discover and catalog ML assets
#
# Pentaho Data Catalog Integration:
# - PDC connects to MLflow via REST API (http://host:5000)
# - PDC can access PostgreSQL directly for enhanced metadata queries (host:5435)
# - Model Registry enables PDC to discover and catalog ML models, versions, experiments
# ===================================================================================

both MinIO and PostgreSQL are using the local file system to store data.

.env

Docker Compose automatically reads a file named .env in the same directory.

You will need to update this file once you've generated your minIO keys.

Here's the .env file:

# ===================================================================================
# MLflow Environment Configuration File (.env)
# This file contains all environment variables for the MLflow Docker Compose setup
# with Pentaho Data Catalog integration support
# ===================================================================================

# ===================================================================================
# PostgreSQL Database Configuration
# Used for MLflow metadata storage and Model Registry backend
# ===================================================================================
PG_USER=mlflow                              # PostgreSQL database username
PG_PASSWORD=mlflow                          # PostgreSQL database password  
PG_DATABASE=mlflow                          # PostgreSQL database name for MLflow
PG_PORT=5435                               # External port for PostgreSQL (host access)
                                           # Internal port is always 5432
                                           # External port used by: host connections, Pentaho Data Catalog
                                           # Internal port used by: MLflow server container connections

# ===================================================================================
# MLflow Server Configuration  
# Core MLflow tracking server settings
# ===================================================================================
MLFLOW_PORT=5000                           # External port for MLflow web UI and API
                                           # Access MLflow at: http://your-host:5000
                                           # Used by: web browsers, API clients, Pentaho Data Catalog

MLFLOW_BUCKET_NAME=mlflow-artefacts        # S3 bucket name for storing MLflow artifacts
                                           # Contains: models, plots, datasets, experiment files
                                           # Note: hyphen in name is intentional (S3 naming convention)

# ===================================================================================
# MinIO Object Storage Configuration (S3-Compatible)
# Used for MLflow artifact storage (models, datasets, plots, files)
# ===================================================================================

# MinIO Authentication - Used by MLflow to access artifact storage
MINIO_ACCESS_KEY=YuA9IVoTrkIAgiLb8Mfv     # S3-compatible access key for MLflow
MINIO_SECRET_ACCESS_KEY=WlR5eRpiIhZGLVSTLPtyzJakMWRUNf1KvqMhvLAA  # S3-compatible secret key for MLflow
                                           # These keys are used by MLflow server to store/retrieve artifacts
                                           # Different from root user credentials below

# MinIO Root Administrator Credentials - Used for initial setup and admin access
MINIO_ROOT_USER=minio_user                 # MinIO administrator username (for console login)
MINIO_ROOT_PASSWORD=minio_pwd              # MinIO administrator password (for console login)
                                           # Used by: MinIO web console, bucket creation script

# MinIO Server Configuration
MINIO_ADDRESS=:9000                        # Internal MinIO server address (all interfaces, port 9000)
MINIO_STORAGE_USE_HTTPS=false              # Disable HTTPS for internal communication (faster)
MINIO_CONSOLE_ADDRESS=:9001                # Internal MinIO web console address (all interfaces, port 9001)
MINIO_PORT=9000                           # External port for MinIO S3 API
MINIO_CONSOLE_PORT=9001                   # External port for MinIO web console
                                           # Access MinIO console at: http://your-host:9001

# ===================================================================================
# MLflow Advanced Configuration
# Settings for Model Registry and artifact handling
# ===================================================================================

# Model Registry Database Connection (REQUIRED for Pentaho Data Catalog)
MLFLOW_REGISTRY_URI=postgresql://${PG_USER}:${PG_PASSWORD}@db:${PG_PORT}/${PG_DATABASE}
                                           # Database connection string for MLflow Model Registry
                                           # Uses variables defined above for consistency
                                           # Note: Uses internal container name 'db' and external port for flexibility

# Default Artifact Storage Location  
MLFLOW_DEFAULT_ARTIFACT_ROOT=s3://${MLFLOW_BUCKET_NAME}
                                           # Default S3 location for storing all MLflow artifacts
                                           # Uses bucket name variable defined above
                                           # Format: s3://bucket-name

# ===================================================================================
# Pentaho Data Catalog Integration Configuration
# These settings are OPTIONAL - PDC connects TO MLflow, not the other way around
# Only used for logging and monitoring purposes in MLflow container
# ===================================================================================

# Pentaho Data Catalog Server Details (for reference/logging only)
PENTAHO_DATA_CATALOG_URL=http://pdc-pentaho.lab:80    # Your PDC server URL
[email protected]            # PDC username (for logging reference)
PENTAHO_DATA_CATALOG_PASSWORD=Welcome123!             # PDC password (for logging reference)
                                           # These credentials are NOT used by MLflow to connect to PDC
                                           # PDC connects to MLflow using MLflow's REST API
                                           # These are stored here for documentation/reference

# ===================================================================================
# MLflow Model Registry Configuration (REQUIRED for Pentaho Data Catalog Integration)
# Enables MLflow Model Registry which PDC uses to discover and catalog ML models
# ===================================================================================

MLFLOW_ENABLE_REGISTRY=true               # Enable MLflow Model Registry feature
                                           # REQUIRED: Pentaho Data Catalog needs Model Registry to function
                                           # Allows: model versioning, stage management, model discovery

MLFLOW_REGISTRY_STORE_URI=postgresql://${PG_USER}:${PG_PASSWORD}@db:${PG_PORT}/${PG_DATABASE}
                                           # Database connection for Model Registry metadata
                                           # Same as MLFLOW_REGISTRY_URI but used in different contexts
                                           # Stores: registered models, model versions, model stages, model tags

# ===================================================================================
# Configuration Summary for Pentaho Data Catalog Integration:
# 
# What PDC Connects To:
# - MLflow REST API: http://your-host:5000
# - MLflow Database: postgresql://mlflow:mlflow@your-host:5435/mlflow
# 
# What PDC Discovers:
# - Experiments and runs (from tracking database)
# - Registered models and versions (from Model Registry)  
# - Model artifacts (served through MLflow from MinIO)
# - Parameters, metrics, and tags (from tracking database)
#
# PDC Configuration (on PDC server):
# - Server Type: MlFlow
# - Server URL: http://your-host:5000  
# - No authentication required (MLflow server is open)
# ===================================================================================

# ===================================================================================
# Security Notes:
# 
# Current Setup (Development/Internal):
# - No authentication on MLflow server (open access)
# - MinIO accessible within Docker network only
# - PostgreSQL accessible externally on port 5435
# 
# Production Recommendations:
# - Enable MLflow authentication (--auth-db or external auth)
# - Use stronger passwords and rotate credentials regularly  
# - Consider TLS/SSL for external connections
# - Restrict network access using firewall rules
# - Use Docker secrets instead of plain text passwords
# ===================================================================================

init-db.sql

Here's the init-db.sql:

-- ===================================================================================
-- PostgreSQL Database Initialization Script for MLflow with Pentaho Data Catalog
-- 
-- This script is automatically executed when the PostgreSQL container starts
-- for the first time. It prepares the database for MLflow usage including:
-- - Model Registry support (required for Pentaho Data Catalog integration)
-- - Proper user permissions for all MLflow operations
-- - Database extensions needed for advanced features
-- 
-- Execution Context:
-- - Runs as PostgreSQL superuser (postgres) during container initialization
-- - Executed via Docker volume mount: ./init-db.sql:/docker-entrypoint-initdb.d/init-db.sql
-- - Only runs on first container startup (when database is empty)
-- - Must complete successfully for container to start
-- ===================================================================================

-- ===================================================================================
-- PostgreSQL Extensions Setup
-- Install necessary extensions for MLflow advanced functionality
-- ===================================================================================

-- UUID Extension for unique identifier generation
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
-- Purpose: Provides functions for generating UUIDs (Universally Unique Identifiers)
-- Used by: MLflow for generating unique IDs for experiments, runs, and models
-- Functions provided:
--   - uuid_generate_v1(): MAC address-based UUIDs
--   - uuid_generate_v4(): Random UUIDs (most commonly used by MLflow)
-- Benefits:
--   - Ensures globally unique identifiers across distributed systems
--   - Required for Model Registry unique model version IDs
--   - Prevents ID collisions in multi-instance deployments
-- Note: IF NOT EXISTS prevents errors if extension is already installed

-- ===================================================================================
-- MLflow Table Creation Strategy
-- MLflow will automatically create its own tables, but we prepare the environment
-- ===================================================================================

-- MLflow Automatic Table Creation:
-- When MLflow server starts, it automatically creates these tables:
--
-- Experiment Tracking Tables:
-- - experiments: Experiment metadata and configuration  
-- - runs: Individual experiment runs with start/end times
-- - metrics: Numeric metrics logged during runs (accuracy, loss, etc.)
-- - params: Parameters/hyperparameters for runs (learning_rate, epochs, etc.)
-- - tags: Key-value metadata tags for experiments and runs
-- - latest_metrics: Optimized view of most recent metric values
--
-- Model Registry Tables (REQUIRED for Pentaho Data Catalog):
-- - registered_models: Model registry entries with names and descriptions
-- - model_versions: Specific versions of registered models
-- - model_version_tags: Tags associated with model versions  
-- - registered_model_tags: Tags associated with registered models
-- - model_version_aliases: Aliases for model versions (e.g., "champion", "challenger")
--
-- Additional Tables:
-- - experiments_tags: Tags associated with experiments
-- - dataset_inputs: Dataset inputs for runs (data lineage)
-- - input_tags: Tags for dataset inputs
-- - trace_info: Tracing information for MLflow deployments
-- - trace_data: Detailed trace data
--
-- Database Migration:
-- MLflow uses Alembic for database schema migrations
-- Schema automatically upgrades when MLflow server starts
-- Migrations are logged during container startup

-- ===================================================================================
-- Database-Level Permissions  
-- Grant comprehensive access to the MLflow database for the mlflow user
-- ===================================================================================

-- Full database access for MLflow operations
GRANT ALL PRIVILEGES ON DATABASE mlflow TO mlflow;
-- Permissions granted:
-- - CONNECT: Connect to the database
-- - CREATE: Create new schemas and objects
-- - TEMPORARY: Create temporary tables and objects
-- - ALL: Complete database-level access
-- 
-- Why necessary:
-- - MLflow needs to create/modify tables during startup
-- - Database migrations require DDL (Data Definition Language) privileges
-- - Experiment logging requires DML (Data Manipulation Language) privileges
-- - Model Registry operations need full CRUD access

-- ===================================================================================
-- Schema-Level Permissions
-- Grant access to the public schema where MLflow creates its tables
-- ===================================================================================

-- Public schema access for MLflow user
GRANT ALL PRIVILEGES ON SCHEMA public TO mlflow;
-- Permissions granted:
-- - USAGE: Access objects within the schema
-- - CREATE: Create new objects in the schema
-- - ALL: Complete schema-level access
--
-- Public schema context:
-- - Default schema in PostgreSQL databases
-- - MLflow creates all tables in public schema by default
-- - Required for table creation and access

-- ===================================================================================
-- Table-Level Permissions (Existing Objects)
-- Grant access to any tables that already exist in the public schema
-- ===================================================================================

-- Access to all existing tables in public schema
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO mlflow;
-- Permissions granted:
-- - SELECT: Read data from tables
-- - INSERT: Add new rows to tables  
-- - UPDATE: Modify existing rows in tables
-- - DELETE: Remove rows from tables
-- - TRUNCATE: Remove all rows from tables
-- - REFERENCES: Create foreign key constraints
-- - TRIGGER: Create triggers on tables
-- - ALL: Complete table-level access
--
-- Scope: Applies to tables that exist at the time this command runs
-- Note: This mainly covers edge cases since MLflow creates its own tables

-- ===================================================================================
-- Sequence-Level Permissions (Existing Objects)  
-- Grant access to any sequences that already exist in the public schema
-- ===================================================================================

-- Access to all existing sequences in public schema  
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO mlflow;
-- Permissions granted:
-- - USAGE: Use NEXTVAL, CURRVAL functions on sequences
-- - SELECT: Read sequence values
-- - UPDATE: Modify sequence values (SETVAL)
-- - ALL: Complete sequence-level access
--
-- Sequences usage:
-- - PostgreSQL uses sequences for auto-incrementing columns (SERIAL, BIGSERIAL)
-- - MLflow tables use sequences for primary key generation
-- - Required for INSERT operations on tables with auto-increment IDs

-- ===================================================================================
-- Default Privileges for Future Objects
-- Ensure MLflow user has access to objects created in the future
-- ===================================================================================

-- Automatic permissions for future tables
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON TABLES TO mlflow;
-- Purpose: Automatically grant permissions when new tables are created
-- Applies to: Any table created in the public schema after this command
-- Permissions: Same as GRANT ALL PRIVILEGES ON ALL TABLES (see above)
-- 
-- Why critical for MLflow:
-- - MLflow creates tables dynamically during operation
-- - Database migrations add new tables and columns
-- - Without this, newly created tables would be inaccessible to mlflow user
-- - Prevents permission-related failures during MLflow operation

-- Automatic permissions for future sequences
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON SEQUENCES TO mlflow;  
-- Purpose: Automatically grant permissions when new sequences are created
-- Applies to: Any sequence created in the public schema after this command
-- Permissions: Same as GRANT ALL PRIVILEGES ON ALL SEQUENCES (see above)
--
-- Why critical for MLflow:
-- - New tables often include auto-incrementing columns with sequences
-- - Database migrations may create new sequences
-- - Ensures INSERT operations work on new tables with SERIAL columns

-- ===================================================================================
-- Pentaho Data Catalog Integration Considerations
-- ===================================================================================

-- Model Registry Requirements:
-- - Pentaho Data Catalog requires MLflow Model Registry to be enabled
-- - Model Registry tables store registered models, versions, and stages  
-- - These permissions ensure PDC can discover and catalog ML models
-- - PDC connects to MLflow via REST API, which queries these tables
--
-- Database Access Patterns:
-- - MLflow REST API: Queries all tables for experiment and model data
-- - Direct PDC Access: PDC may query database directly for enhanced metadata
-- - Model Governance: Model Registry tables support PDC governance workflows
--
-- Performance Considerations:
-- - Full privileges enable MLflow to create indexes for performance
-- - Allows creation of materialized views for complex queries
-- - Supports database-level optimizations for large datasets

-- ===================================================================================
-- Security and Production Considerations
-- ===================================================================================

-- Current Configuration (Development/Internal):
-- - MLflow user has full database privileges (suitable for single-tenant)
-- - No row-level security or column-level restrictions
-- - Appropriate for internal/development environments
--
-- Production Recommendations:
-- - Consider more granular permissions if sharing database with other applications
-- - Implement connection pooling for better resource management
-- - Regular database backups to prevent data loss
-- - Monitor database performance and storage usage
-- - Consider read replicas for analytics workloads (like Pentaho Data Catalog queries)
--
-- Multi-tenant Considerations:
-- - If hosting multiple MLflow instances, consider separate databases or schemas
-- - Implement role-based access control for different user types
-- - Consider audit logging for compliance requirements

-- ===================================================================================
-- Troubleshooting Common Issues
-- ===================================================================================

-- Permission Errors:
-- - If MLflow fails with "permission denied" errors, verify these grants completed
-- - Check MLflow logs for specific permission issues
-- - Ensure database connection string uses correct username/password
--
-- Table Creation Failures:
-- - If MLflow can't create tables, verify GRANT ALL ON DATABASE succeeded
-- - Check disk space on PostgreSQL data volume
-- - Verify PostgreSQL container has sufficient memory
--
-- Model Registry Issues:
-- - If Model Registry features don't work, ensure uuid-ossp extension loaded
-- - Verify registered_models and model_versions tables exist after MLflow startup
-- - Check MLflow server logs for Model Registry initialization messages
-- ===================================================================================

Dockerfile

Used to build & configure your MLflow instance ..

Here's the Dockerfile:

# ===================================================================================
# MLflow Server Dockerfile with Pentaho Data Catalog Integration
# 
# This Dockerfile creates a custom MLflow server image that includes:
# - MLflow tracking server with Model Registry support
# - Database connectivity (PostgreSQL)
# - S3-compatible storage support (MinIO/AWS S3)
# - Machine learning libraries for compatibility
# - Pentaho Data Catalog integration helpers
# - Service dependency management (wait for DB/MinIO)
# ===================================================================================

# ===================================================================================
# Base Image Selection
# Using Python 3.10 slim image for optimal balance of features and size
# ===================================================================================
FROM python:3.10-slim
# python:3.10-slim provides:
# - Python 3.10 runtime (compatible with latest MLflow)
# - Debian-based Linux (stable, well-supported)
# - Minimal footprint (excludes unnecessary packages)
# - Good security posture (fewer attack vectors)

# ===================================================================================
# Build Environment Configuration
# Suppress interactive package configuration prompts during build
# ===================================================================================
ENV DEBIAN_FRONTEND=noninteractive
# This prevents package installation from hanging on configuration prompts
# Only affects the Docker build process, not runtime behavior
# Reset to default after package installation to maintain normal behavior

# ===================================================================================
# System Dependencies Installation
# Install essential system packages required for MLflow and integrations
# ===================================================================================
RUN apt-get update && apt-get install -y \
    curl \
    gcc \
    g++ \
    netcat-traditional \
    && rm -rf /var/lib/apt/lists/*
# Package breakdown:
# - curl: HTTP client for health checks and API calls
# - gcc: C compiler for building Python packages with native extensions
# - g++: C++ compiler for building packages like psycopg2
# - netcat-traditional: Network utility for checking service availability
# - rm -rf /var/lib/apt/lists/*: Cleanup to reduce image size

# Reset environment to default for normal runtime behavior
ENV DEBIAN_FRONTEND=

# ===================================================================================
# Python Dependencies Installation  
# Install MLflow and all required Python packages for full functionality
# ===================================================================================
RUN pip install --no-cache-dir \
    mlflow[extras] \
    scikit-learn \
    pandas \
    numpy \
    matplotlib \
    psycopg2-binary \
    boto3 \
    cryptography \
    pymysql \
    requests \
    sqlalchemy \
    alembic
# Package categories and purposes:
#
# Core MLflow:
# - mlflow[extras]: MLflow with additional plugins and integrations
#
# Machine Learning Libraries (for broad model compatibility):
# - scikit-learn: Popular ML library, commonly used with MLflow
# - pandas: Data manipulation and analysis library
# - numpy: Numerical computing library (dependency for most ML packages)
# - matplotlib: Plotting library for generating charts/visualizations
#
# Database Connectivity:
# - psycopg2-binary: PostgreSQL adapter for Python (for Model Registry)
# - pymysql: MySQL adapter for Python (alternative database support)
# - sqlalchemy: SQL toolkit and ORM (used by MLflow for database operations)
# - alembic: Database migration tool (used by MLflow for schema management)
#
# Cloud Storage & Security:
# - boto3: AWS SDK for Python (S3 and MinIO compatibility)
# - cryptography: Cryptographic libraries for secure connections
#
# HTTP & Integration:
# - requests: HTTP library for API calls (Pentaho integration)
#
# Build optimization:
# - --no-cache-dir: Don't cache downloaded packages (reduces image size)

# ===================================================================================
# Application Directory Setup
# Create dedicated directory for MLflow application and scripts
# ===================================================================================
WORKDIR /app
# Sets /app as the working directory for subsequent commands
# All relative paths in COPY, RUN commands will be relative to /app
# MLflow server will run from this directory

# ===================================================================================
# Application Files Integration
# Copy custom integration scripts into the container
# ===================================================================================
COPY pentaho_integration.py /app/
# Pentaho Data Catalog integration helper script
# Contains: PDC connection logging, integration setup functions
# Purpose: Initialize PDC integration and provide connection information

COPY entrypoint.sh /app/
# Custom startup script that orchestrates service initialization
# Functions:
# - Wait for PostgreSQL database to be ready
# - Wait for MinIO object storage to be ready  
# - Initialize Pentaho Data Catalog integration
# - Start MLflow server with proper configuration

# ===================================================================================
# File Permissions Configuration
# Ensure startup script has execute permissions
# ===================================================================================
RUN chmod +x /app/entrypoint.sh
# Makes entrypoint.sh executable so it can be run as the container's main process
# Required for proper container startup

# ===================================================================================
# Network Port Configuration
# Expose MLflow server port for external access
# ===================================================================================
EXPOSE 5000
# Documents that the container listens on port 5000
# MLflow tracking server default port
# Used by:
# - Web UI access (http://host:5000)
# - REST API calls (for logging experiments, querying data)
# - Pentaho Data Catalog connections
# Note: This is documentation only; actual port mapping happens in docker-compose.yml

# ===================================================================================
# Container Startup Configuration
# Set custom entrypoint script as the main container process
# ===================================================================================
ENTRYPOINT ["/app/entrypoint.sh"]
# Starts the custom entrypoint script when container launches
# The entrypoint script will:
# 1. Wait for dependent services (PostgreSQL, MinIO)
# 2. Initialize Pentaho Data Catalog integration
# 3. Start MLflow server with full configuration
# 4. Handle graceful shutdown signals
#
# Alternative approaches not used:
# - CMD ["mlflow", "server", ...]: Would start MLflow directly without dependency checks
# - ENTRYPOINT ["mlflow"] + CMD [...]: Would bypass custom initialization logic

# ===================================================================================
# Runtime Behavior Summary
# 
# When this container starts:
# 1. entrypoint.sh executes as PID 1
# 2. Script waits for database and MinIO to be ready
# 3. Pentaho integration is initialized (logging/setup)
# 4. MLflow server starts with full configuration:
#    - Backend store: PostgreSQL (experiments, runs, Model Registry)
#    - Artifact store: MinIO S3-compatible storage
#    - Model Registry: Enabled (required for Pentaho Data Catalog)
#    - REST API: Available on port 5000
#    - Web UI: Available on port 5000
# 5. Container remains running until stopped
#
# Health checks and monitoring:
# - MLflow provides /health endpoint for monitoring
# - Service dependencies managed through Docker Compose
# - Logs available via docker compose logs tracking_server
# ===================================================================================

# ===================================================================================
# Integration Architecture Summary
#
# This container provides:
# - MLflow Tracking Server (experiments, runs, parameters, metrics)
# - MLflow Model Registry (model versions, stages, tags)
# - REST API endpoints for all MLflow functionality
# - Artifact serving (models, plots, datasets from MinIO)
# - Database connectivity (PostgreSQL for metadata)
# - S3-compatible storage (MinIO for artifacts)
#
# Pentaho Data Catalog Integration:
# - PDC connects to this container's REST API (port 5000)
# - PDC discovers experiments, models, runs through MLflow API
# - PDC can access Model Registry for governance workflows
# - No authentication required (container designed for internal network)
# - All ML metadata available for cataloging and discovery
# ===================================================================================

entrypoint.sh

Here's the entrypoint.sh:

#!/bin/bash

# ===================================================================================
# MLflow Server Entrypoint Script with Pentaho Data Catalog Integration
# 
# This script orchestrates the startup of MLflow server within a Docker container
# ensuring all dependencies are ready before starting the main service.
#
# Responsibilities:
# - Wait for PostgreSQL database to be ready
# - Wait for MinIO object storage to be ready
# - Initialize Pentaho Data Catalog integration
# - Start MLflow server with full configuration
# - Handle graceful shutdown (via exec)
# ===================================================================================

# ===================================================================================
# Startup Announcement
# Provide clear indication that the container is starting with PDC integration
# ===================================================================================
echo "Starting MLflow server with Pentaho Data Catalog integration..."
# This message appears in container logs and helps identify the custom startup process
# Useful for debugging and monitoring container initialization

# ===================================================================================
# Database Dependency Check
# Wait for PostgreSQL to be ready before proceeding
# ===================================================================================
echo "Waiting for database to be ready..."

# Service availability check using netcat (nc)
while ! nc -z db 5432; do
    sleep 1
done
# Loop breakdown:
# - nc -z db 5432: Check if port 5432 is open on host 'db' (zero I/O mode)
# - db: Docker container hostname for PostgreSQL service
# - 5432: Standard PostgreSQL port (internal container port)
# - while ! [...]: Continue looping while the command fails (port not open)
# - sleep 1: Wait 1 second between checks to avoid overwhelming the network
# 
# Why this is necessary:
# - PostgreSQL needs time to initialize database files
# - MLflow requires database connection immediately on startup
# - Without this check, MLflow would fail with connection errors
# - Docker depends_on only waits for container start, not service readiness

echo "Database is ready!"
# Confirmation message for successful database connectivity
# Appears in logs when PostgreSQL is accepting connections

# ===================================================================================
# MinIO Object Storage Dependency Check  
# Wait for MinIO S3-compatible storage to be ready
# ===================================================================================
echo "Waiting for MinIO to be ready..."

# Service availability check for MinIO API
while ! nc -z s3 9000; do
    sleep 1
done
# Loop breakdown:
# - nc -z s3 9000: Check if port 9000 is open on host 's3' (zero I/O mode)  
# - s3: Docker container hostname for MinIO service
# - 9000: MinIO S3 API port (internal container port)
# - Same loop logic as database check above
#
# Why this is necessary:
# - MinIO needs time to initialize storage backend
# - MLflow artifact logging requires immediate S3 API access
# - Without this check, artifact storage would fail
# - Bucket creation service also depends on MinIO being ready

echo "MinIO is ready!"
# Confirmation message for successful MinIO connectivity
# Appears in logs when MinIO S3 API is accepting requests

# ===================================================================================
# Pentaho Data Catalog Integration Initialization
# Run custom Python script to set up PDC integration
# ===================================================================================
echo "Initializing Pentaho Data Catalog integration..."

# Execute Pentaho integration helper script
python /app/pentaho_integration.py
# Script functions:
# - Log PDC connection information for reference
# - Display configuration instructions for PDC setup
# - Validate integration environment variables
# - Prepare any PDC-specific initialization
#
# Note: This script runs synchronously and must complete before MLflow starts
# If the script fails, the container startup will be interrupted
# The script is non-blocking for MLflow functionality (PDC connects TO MLflow)

# ===================================================================================
# MLflow Server Startup
# Launch MLflow tracking server with complete configuration
# ===================================================================================
echo "Starting MLflow tracking server..."

# Start MLflow server with comprehensive configuration
exec mlflow server \
    --backend-store-uri postgresql://${PG_USER}:${PG_PASSWORD}@db:5432/${PG_DATABASE} \
    --registry-store-uri postgresql://${PG_USER}:${PG_PASSWORD}@db:5432/${PG_DATABASE} \
    --default-artifact-root s3://${MLFLOW_BUCKET_NAME} \
    --host 0.0.0.0 \
    --port 5000 \
    --serve-artifacts \
    --artifacts-destination s3://${MLFLOW_BUCKET_NAME}

# ===================================================================================
# MLflow Server Configuration Breakdown
# ===================================================================================

# Database Configuration:
# --backend-store-uri postgresql://${PG_USER}:${PG_PASSWORD}@db:5432/${PG_DATABASE}
# Purpose: Stores experiment metadata (runs, parameters, metrics, tags)
# Connection details:
# - ${PG_USER}: PostgreSQL username from environment variable
# - ${PG_PASSWORD}: PostgreSQL password from environment variable  
# - db: Docker container hostname for PostgreSQL service
# - 5432: Internal PostgreSQL port (standard)
# - ${PG_DATABASE}: Database name from environment variable
# 
# Data stored: experiments, runs, parameters, metrics, tags, experiment metadata

# Model Registry Configuration:
# --registry-store-uri postgresql://${PG_USER}:${PG_PASSWORD}@db:5432/${PG_DATABASE}  
# Purpose: Stores Model Registry metadata (registered models, versions, stages)
# Connection: Same database as backend store (common configuration)
# Data stored: registered models, model versions, model stages, model tags
# CRITICAL: Required for Pentaho Data Catalog integration

# Artifact Storage Configuration:
# --default-artifact-root s3://${MLFLOW_BUCKET_NAME}
# Purpose: Default location for storing run artifacts
# Format: S3 URI pointing to MinIO bucket
# ${MLFLOW_BUCKET_NAME}: Bucket name from environment variable
# Artifacts stored: models, plots, datasets, logs, any logged files

# Network Configuration:  
# --host 0.0.0.0
# Purpose: Bind MLflow server to all network interfaces
# Effect: Allows connections from outside the container
# Required for: Docker port mapping, external API access, PDC connections

# --port 5000  
# Purpose: MLflow server listening port
# Standard: MLflow default port
# Mapped to: External port via Docker Compose port configuration

# Artifact Serving Configuration:
# --serve-artifacts
# Purpose: Enable MLflow server to serve artifacts directly
# Benefit: Artifacts accessible through MLflow API without direct S3 access
# Required for: Web UI artifact downloads, API artifact retrieval

# --artifacts-destination s3://${MLFLOW_BUCKET_NAME}
# Purpose: Destination for artifact uploads through MLflow server
# Same as default-artifact-root but used for server-mediated uploads
# Enables: Artifact uploads through MLflow API instead of direct S3 access

# ===================================================================================
# Process Management Notes
# ===================================================================================

# exec command usage:
# - exec replaces the shell process with MLflow server process
# - MLflow server becomes PID 1 in the container
# - Ensures proper signal handling (SIGTERM, SIGINT)
# - Enables graceful shutdown when container is stopped
# - Without exec: shell remains as PID 1, MLflow becomes child process

# Signal handling:
# - Docker stop sends SIGTERM to PID 1
# - MLflow server can handle SIGTERM gracefully (close connections, flush data)
# - Container shutdown is clean and doesn't corrupt data

# ===================================================================================
# Integration Architecture Summary
# ===================================================================================

# Service Dependencies (enforced by this script):
# 1. PostgreSQL must be ready (accepting connections on port 5432)
# 2. MinIO must be ready (accepting API requests on port 9000)  
# 3. Pentaho integration initialization must complete
# 4. Then MLflow server starts with full configuration

# External Integration Points:
# - MLflow REST API: Available on port 5000 for PDC connections
# - PostgreSQL: Direct database access available for PDC queries
# - Artifact Storage: Accessible through MLflow API for PDC artifact discovery
# - Model Registry: Available through MLflow API for PDC model governance

# Pentaho Data Catalog Integration:
# - PDC connects to MLflow REST API (http://container:5000)
# - PDC discovers experiments, runs, models through API calls
# - PDC can query Model Registry for governance workflows
# - PDC can access artifact metadata for cataloging
# - No authentication required (internal network deployment)
# ===================================================================================

#!/usr/bin/env python3

# ===================================================================================
# Pentaho Data Catalog Integration Helper for MLflow
# 
# This script provides integration support between MLflow and Pentaho Data Catalog.
# It runs during MLflow container startup to:
# - Initialize PDC integration environment
# - Log connection information for PDC configuration
# - Provide setup instructions and troubleshooting guidance
# - Validate integration configuration
#
# Architecture: 
# - MLflow serves as the ML metadata source
# - Pentaho Data Catalog connects TO MLflow (not vice versa)
# - This script is informational/setup helper, not a live integration
# ===================================================================================

"""
Pentaho Data Catalog Integration Helper for MLflow

This module provides helper functions and classes for integrating MLflow
with Pentaho Data Catalog (PDC). The integration enables PDC to discover,
catalog, and govern machine learning models, experiments, and artifacts
stored in MLflow.

Key Features:
- Connection validation and logging
- Configuration guidance for PDC setup
- Integration troubleshooting helpers
- Environment validation

Integration Flow:
1. MLflow server starts with Model Registry enabled
2. This script logs connection information
3. PDC administrator configures external data source
4. PDC connects to MLflow REST API for ML asset discovery
"""

# ===================================================================================
# Import Dependencies
# Core Python and third-party libraries for integration functionality
# ===================================================================================

import os                    # Environment variable access for configuration
import requests             # HTTP client for potential API calls (future use)
import logging              # Structured logging for integration events
from typing import Dict, Any, Optional  # Type hints for better code documentation
import mlflow              # MLflow client library (for future integration features)

# ===================================================================================
# Logging Configuration
# Set up structured logging for integration events and troubleshooting
# ===================================================================================

# Configure logging with INFO level for detailed startup information
logging.basicConfig(level=logging.INFO)
# Benefits of INFO level:
# - Shows integration initialization progress
# - Displays connection information for PDC setup
# - Provides troubleshooting information in container logs
# - Visible in docker compose logs tracking_server

# Create module-specific logger for organized log messages
logger = logging.getLogger(__name__)
# Logger naming convention:
# - Uses module name (__name__) for log source identification
# - Enables filtering and routing of integration-specific logs
# - Separates integration logs from MLflow server logs

# ===================================================================================
# Pentaho Data Catalog Integration Client Class
# Main class for managing PDC integration functionality
# ===================================================================================

class PentahoDataCatalogClient:
    """
    Client for integrating with Pentaho Data Catalog
    
    This class manages the connection and integration between MLflow and 
    Pentaho Data Catalog. It handles configuration validation, connection
    logging, and provides guidance for PDC administrators.
    
    Note: This is primarily a helper/informational class. The actual
    integration happens when PDC connects to MLflow's REST API.
    
    Attributes:
        pdc_url (str): Pentaho Data Catalog server URL
        username (str): PDC username (for reference/logging)
        password (str): PDC password (for reference/logging)
        session (requests.Session): HTTP session for potential future API calls
    """
    
    def __init__(self):
        """
        Initialize Pentaho Data Catalog client with environment configuration
        
        Reads configuration from environment variables set in docker-compose.yml.
        These variables are primarily used for logging and reference purposes,
        as PDC connects TO MLflow rather than MLflow connecting to PDC.
        """
        
        # Load PDC server configuration from environment variables
        self.pdc_url = os.getenv('PENTAHO_DATA_CATALOG_URL')
        # Purpose: PDC server URL for reference and logging
        # Example: http://pdc-pentaho.lab:80
        # Usage: Displayed in logs for administrator reference
        
        self.username = os.getenv('PENTAHO_DATA_CATALOG_USERNAME')  
        # Purpose: PDC username for reference and documentation
        # Example: [email protected]
        # Usage: Logged for PDC administrator identification
        
        self.password = os.getenv('PENTAHO_DATA_CATALOG_PASSWORD')
        # Purpose: PDC password for reference (not used for authentication)
        # Example: Welcome123!
        # Usage: Available for future direct PDC API integration
        
        # Initialize HTTP session for potential future PDC API calls
        self.session = requests.Session()
        # Benefits:
        # - Reuses connections for better performance
        # - Maintains session state across requests
        # - Supports authentication and custom headers
        # - Currently unused but available for future enhancements
        
        # Log successful configuration if PDC URL is provided
        if self.pdc_url:
            logger.info(f"Pentaho Data Catalog integration configured for: {self.pdc_url}")
            # This message appears in MLflow container logs during startup
            # Helps administrators verify PDC integration is properly configured
    
    def log_mlflow_connection(self) -> bool:
        """
        Log MLflow connection information for Pentaho Data Catalog setup
        
        This method displays the connection details that PDC administrators
        need to configure the external data source connection to MLflow.
        
        Returns:
            bool: True if logging successful, False if error occurred
            
        Connection Information Provided:
        - MLflow Tracking URI: REST API endpoint for PDC connections
        - MLflow Registry URI: Database connection for direct metadata access
        """
        try:
            # Log integration readiness message
            logger.info("MLflow server is ready for Pentaho Data Catalog integration")
            
            # Display configuration header for PDC administrators
            logger.info("Configure Pentaho Data Catalog with:")
            
            # MLflow REST API endpoint for PDC external data source configuration
            logger.info(f"  - MLflow Tracking URI: http://pdc:5000")
            # Connection details:
            # - Protocol: HTTP (internal Docker network, no TLS needed)
            # - Host: pdc (Docker container hostname or actual server name)
            # - Port: 5000 (MLflow tracking server port)
            # - Usage: PDC uses this URL for REST API calls to discover ML assets
            
            # PostgreSQL database connection for enhanced PDC metadata queries
            logger.info(f"  - MLflow Registry URI: postgresql://mlflow:mlflow@pdc:5435/mlflow")
            # Connection string breakdown:
            # - Protocol: postgresql:// (PostgreSQL database connection)
            # - Username: mlflow (database user with full MLflow permissions)
            # - Password: mlflow (database password from environment)
            # - Host: pdc (Docker container hostname or actual server name)
            # - Port: 5435 (external PostgreSQL port from docker-compose.yml)
            # - Database: mlflow (database name containing MLflow tables)
            # - Usage: PDC can query database directly for enhanced metadata access
            
            return True
            # Success indicator for calling functions
            
        except Exception as e:
            # Error handling for unexpected issues during logging
            logger.error(f"Error logging connection info: {str(e)}")
            # Log any exceptions that occur during connection info display
            # Helps troubleshoot integration setup issues
            return False
            # Failure indicator for calling functions

# ===================================================================================
# Integration Setup Function
# Main orchestration function for Pentaho Data Catalog integration
# ===================================================================================

def setup_pentaho_integration():
    """
    Setup Pentaho Data Catalog integration
    
    This function orchestrates the complete PDC integration setup process:
    1. Initialize PDC client with environment configuration
    2. Log MLflow connection information for PDC setup
    3. Display step-by-step configuration instructions
    4. Return configured client for potential future use
    
    Returns:
        PentahoDataCatalogClient: Configured PDC client instance
        
    Integration Architecture:
    - MLflow provides REST API and database access
    - PDC connects as external client to discover ML assets
    - Model Registry enables PDC governance workflows
    - No authentication required for internal network deployment
    """
    
    # Initialize Pentaho Data Catalog client with environment configuration
    pdc_client = PentahoDataCatalogClient()
    # Creates client instance with PDC server details from environment variables
    # Validates configuration and logs PDC server URL if provided
    
    # Display MLflow connection information for PDC configuration
    pdc_client.log_mlflow_connection()
    # Shows connection URLs and database details that PDC administrators
    # need to configure the external data source in PDC
    
    # Log integration status and readiness
    logger.info("MLflow server configured for Pentaho Data Catalog integration")
    # Confirms that MLflow is properly configured with:
    # - Model Registry enabled (required for PDC)
    # - REST API available (for PDC discovery)
    # - Database accessible (for PDC queries)
    # - Artifact storage configured (for PDC asset management)
    
    # Display step-by-step configuration instructions for PDC administrators
    logger.info("Next steps:")
    logger.info("1. In Pentaho Data Catalog, configure ML Model Server connection")
    # PDC Configuration Step 1:
    # - Navigate to PDC Management → Data Sources
    # - Add new external data source
    # - Select "MLflow" as the server type
    
    logger.info("2. Set MLflow Tracking URI to your Docker host MLflow server")
    # PDC Configuration Step 2:
    # - Use the MLflow Tracking URI displayed above
    # - Example: http://pdc:5000
    # - This enables PDC to discover experiments, runs, and models
    
    logger.info("3. Set Registry Store URI to your PostgreSQL database")
    # PDC Configuration Step 3:
    # - Use the MLflow Registry URI displayed above
    # - Example: postgresql://mlflow:mlflow@pdc:5435/mlflow
    # - This enables PDC to access Model Registry metadata
    
    logger.info("4. Import ML model server components from MLflow")
    # PDC Configuration Step 4:
    # - In PDC Management → Synchronize
    # - Find the configured MLflow server
    # - Click "Import" to sync ML assets
    # - PDC will discover and catalog experiments, models, versions, runs
    
    # Return configured client for potential future use
    return pdc_client
    # Makes client available for:
    # - Additional integration functionality
    # - Future API calls to PDC
    # - Integration monitoring and validation

# ===================================================================================
# Script Entry Point
# Execute integration setup when script is run directly
# ===================================================================================

if __name__ == "__main__":
    """
    Main execution block for direct script execution
    
    This block runs when the script is executed directly (not imported).
    It's called from the Docker container entrypoint during MLflow startup.
    
    Execution Context:
    - Runs inside MLflow Docker container during startup
    - Called from entrypoint.sh after service dependencies are ready
    - Executes before MLflow server starts
    - Logs appear in docker compose logs tracking_server
    """
    
    # Execute complete Pentaho Data Catalog integration setup
    setup_pentaho_integration()
    # This function call:
    # 1. Initializes PDC client with environment configuration
    # 2. Logs connection information for PDC administrators
    # 3. Displays step-by-step setup instructions
    # 4. Prepares MLflow for PDC integration
    
    # Script completion - MLflow container startup continues
    # Next step: entrypoint.sh starts MLflow server with full configuration

# ===================================================================================
# Integration Summary and Architecture Notes
# ===================================================================================

"""
Pentaho Data Catalog Integration Architecture:

MLflow Side (This Container):
├── REST API (port 5000)
│   ├── Experiments and Runs Discovery
│   ├── Model Registry Access  
│   ├── Artifact Metadata
│   └── Parameter and Metrics Queries
├── PostgreSQL Database (port 5435)
│   ├── Direct metadata access for PDC
│   ├── Model Registry tables
│   ├── Experiment tracking tables
│   └── Enhanced query capabilities
└── MinIO Artifact Storage
    ├── Model files and versions
    ├── Experiment artifacts
    ├── Plots and visualizations
    └── Dataset files

Pentaho Data Catalog Side:
├── External Data Source Configuration
│   ├── MLflow Server Type
│   ├── Tracking URI: http://pdc:5000
│   └── Registry URI: postgresql://mlflow:mlflow@pdc:5435/mlflow
├── ML Models Hierarchy
│   ├── Imported MLflow servers
│   ├── Discovered experiments
│   ├── Cataloged models and versions
│   └── Governance workflows
└── Synchronization Process
    ├── Periodic ML asset discovery
    ├── Metadata import from MLflow
    ├── Model lifecycle tracking
    └── Compliance and governance

Integration Benefits:
- Centralized ML asset discovery and cataloging
- Model governance and lifecycle management
- Compliance tracking and audit trails  
- Enhanced ML metadata search and discovery
- Integration with broader data governance workflows
"""

Everything is now in place to build the MLflow tracking_server:

docker compose build tracking_server

minIO

Log into minIO:

https://pdc.pentaho.lab:9001pdc.pentaho.lab

Username: minio_user

Password: minio_pwd

“The community version is now limited to being an object browser only, with deprecated support for accounts & policies management, bucket management, configuration management, lifecycle & tiers management, and site replication.

minio:RELEASE.2025-04-22T22-12-26Z  # Last version with full admin UI

Use a community fork: OpenMaxIO is a community-maintained fork created to preserve the full functionality that was removed from MinIO CE

Create an access key:

Your access key and secret key are not saved until you click the Create button.

Download for import & update the config.env file - save

# MinIO access keys - these are needed by MLflow
MINIO_ACCESS_KEY=YuA9IVoTrkIAgiLb8Mfv
MINIO_SECRET_ACCESS_KEY=WlR5eRpiIhZGLVSTLPtyzJakMWRUNf1KvqMhvLAA

(Optional) Create a bucket: mlflow-artefacts.

The docker-compose.yml will hopefully create the mlflow-artefacts bucket.

MLflow Tracking

The first time you fired up these services, the MLflow Tracking services will not work as you'll need to login into MinIO UI, create your keys and mlflow bucket that the tracking service will use to store artifacts.

Stop & remove your containers

cd
cd ~/MLflow
docker compose down

Restart the containers:

cd
cd ~/MLflow
docker compose up -d --build

You can connect a machine learning (ML) server to Data Catalog and import ML model server components, including ML models, experiments, versions, and runs into the ML Models hierarchy within Data Catalog.

Edit the file external-data-source-config.yml :

cd
sudo nano /opt/pentaho/pdc-docker-deployment/conf/external-datasource/external-data-source-config.yml

Add the MLflow server configuration:

servers:
   - id: {SERVER_ID}
     name: {SERVER_NAME}
     type: {SERVER_TYPE}
     url: {SERVER_URL}
config:
     username: {username}
     password: {password}
     access_token: {access_token}

Parameter

Description

Value

Unique identifier (UUID) for the ML server.

Generate a UUID or

name

Display name for the server. This name appears in the UI.

MLflow Production Server

type

Type of server (enum value). For ML server, use ‘MlFlow’.

MlFlow

URL

The base URL of the ML server.

http://pdc:5000

username

MLflow username

Not required

password

MLflow password

Not required

access_token

Generated access_token

Not required

After configuring the ML server in the YAML file, restart the PDC services to apply the changes.

You have successfully configured the ML server in Data Catalog as an external data source. It appears under the Synchronize card in the Management section of Data Catalog.