MinIO

Access S3 type Object Store -VFS ..

Workshop - MinIO

MinIO is a high-performance, Kubernetes-native object storage system designed for cloud-native applications. Built from the ground up to be compatible with Amazon S3, MinIO offers a lightweight yet powerful alternative for organizations looking to deploy object storage in their own infrastructure.

At its core, MinIO provides distributed object storage with performance characteristics. It's capable of handling millions of operations per second and can store petabytes of data while maintaining sub-millisecond latency. This performance is achieved through a simplified architecture that eliminates complex dependencies and optimizes for modern hardware capabilities.

One of MinIO's key strengths lies in its versatility. It can be deployed virtually anywhere - from bare metal servers to public, private, and edge cloud environments. Organizations particularly value its seamless integration with Kubernetes, making it an ideal choice for containerized environments. MinIO's open-source nature also provides transparency and flexibility that many enterprises require for their data infrastructure needs.

Please ensure you have completed the following setup: MinIO

To create a new Transformation

Any one of these actions opens a new Transformation tab for you to begin designing your transformation.

By clicking File > New > Transformation
By using the CTRL-N hot key

Log into MinIO.

http://localhost:9002/loginlocalhost

Username: minioadmin

Password: minioadmin

If you have completed the setup: MinIO then you should have pre-populated buckets with various data objects in different formats.

New Bucket

If you need to create a Bucket:

Click the 'Create Bucket' link.
Enter: sales-data & 'Create Bucket'.

Click on the Upload button.

Upload your data - for exmple some sales data:

Windows

C:\Pentaho\design-tools\data-integration\samples\transformations\files

Linux

~/Pentaho/design-tools/data-integration/samples/transformations/files

Virtual File Systems

PDI allows you to establish connections to most Virtual File Systems (VFS) through VFS connections. These connections store the necessary properties to access specific file systems, eliminating the need to repeatedly enter configuration details.

Once you've added a VFS connection in PDI, you can reference it whenever you need to work with files or folders on that Virtual File System. This streamlines your workflow by allowing you to reuse connection information across multiple steps.

For instance, if you're working with Hitachi Content Platform (HCP), you can create a single VFS connection and then use it throughout all HCP transformation steps. This approach saves time and ensures consistency by removing the need to re-enter credentials or access information for each data operation.

Start Pentaho Data Integration.

Windows - PowerShell

Set-Location C:\Pentaho\design-tools\data-integration
.\spoon.bat

Linux

cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh

Create a VFS connection to the MinIO buckets

Click: 'View' Tab.
Right mouse click on VFS Connections > New.

Enter the following details:

Setting

Value

Connection Name

MinIO

Connection Type

Minio/HCP

Description

Connection to sales-data bucket

S3 Connection Type

Minio/HCP

Access Key

minioadmin

Secret Key

minioadmin

Endpoint

http://localhost:9000 [MinIO API endpoint]

Signature Version

AWSS3V4SignerType

PathStyle Acess:

enable

Root Folder Path

Test the connection.

Workshops

Pentaho Data Integration: MinIO Object Storage Workshop Series

Modern organizations increasingly store their data in cloud-native object storage systems like MinIO and Amazon S3, moving away from traditional file servers and databases. This architectural shift enables scalable, cost-effective data lakes but introduces new challenges: data arrives in multiple formats (CSV, JSON, XML, Parquet), exists across distributed buckets, and requires sophisticated transformation pipelines to unlock its analytical value. Learning to efficiently extract, transform, and integrate data from object storage is now essential for any data integration professional working with contemporary data architectures.

In this comprehensive workshop series, you'll build progressively complex transformation pipelines that leverage MinIO object storage as both a source and destination for enterprise data integration scenarios. Starting with fundamental ETL patterns like denormalized fact table creation, you'll advance through intermediate challenges involving multi-format parsing and reconciliation, ultimately mastering advanced techniques like sessionization, anomaly detection, and schema normalization across heterogeneous data sources.

Each workshop introduces real-world business scenarios - from sales dashboards to customer analytics to operational monitoring - demonstrating PDI's versatility in solving diverse integration challenges while maintaining cloud-native architecture principles.

What You'll Accomplish:

Configure VFS (Virtual File System) connections to access S3-compatible MinIO object storage
Build multi-source ETL pipelines using Text File Input steps with S3 paths (s3a://)
Implement Stream Lookup and Merge Join patterns to enrich data from multiple CSV sources
Parse semi-structured formats including XML inventory feeds and JSONL event streams
Apply full outer joins to identify discrepancies between warehouse and catalog systems
Aggregate customer data across transactional, demographic, and behavioral dimensions
Perform sessionization and funnel analysis on clickstream data using Group By and pivoting
Extract structured data from unstructured logs using Regular Expression evaluation
Detect anomalies in time-series data through rolling averages and conditional logic
Normalize schemas across CSV, JSON, and XML sources into unified data lake structures
Implement data validation, deduplication, and quality controls for multi-format ingestion
Calculate derived metrics including customer lifetime value, engagement scores, and conversion rates
Route data dynamically using Switch/Case and Filter Rows for conditional processing
Output transformed data to staging and curated layers following data lake architecture patterns

By the end of this workshop series, you'll have mastered the complete spectrum of cloud-native data integration patterns using Pentaho Data Integration. You'll understand how to handle diverse source formats, implement sophisticated join and aggregation logic, perform advanced text parsing and time-series analysis, and build production-ready pipelines that leverage object storage for scalable, distributed data processing.

Instead of treating each data format as a unique challenge requiring custom scripts, you'll confidently design reusable, visual transformation workflows that automate complex integration scenarios - from operational reconciliation to customer intelligence to real-time anomaly detection.

Prerequisites: MinIO running with sample data populated; basic understanding of transformation concepts (steps, hops, preview); familiarity with joins and aggregations

Estimated Time: 4-6 hours total (individual workshops range from 20-60 minutes based on complexity)

Workshop

Key Skills

Sales Dashboard

joins, lookups, aggregations

Inventory Reconciliation

XML parsing, outer joins, variance

Customer 360

multi-source, JSONL, calculations

Clickstream Funnel

sessionization, pivoting

Log Parsing

regex, time-series analysis

Data Lake Ingestion

schema normalization, validation

1. Verify that MinIO is running and populated.

# Check MinIO is running
curl -sf http://localhost:9000/minio/health/live && echo "MinIO OK" || echo "MinIO not running"

# Verify data exists (using mc client)
mc ls minio-local/raw-data --recursive

Start Pentaho Data Integration.

Windows - PowerShell:

Set-Location C:\Pentaho\design-tools\data-integration
.\spoon.bat

Linux:

cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh

Workshops

Sales dashboard

The workshop demonstrates how Pentaho Data Integration enables organizations to rapidly create denormalized fact tables that power real-time business intelligence dashboards. By integrating data from multiple sources (customer data, product catalogs, and sales transactions), business users gain immediate access to actionable insights without waiting for IT to build complex data warehouses.

Scenario: A mid-sized e-commerce company needs to track daily sales performance across products, customer segments, and regions. Currently, sales managers wait 24-48 hours for IT to generate reports from disparate systems. With PDI, they can automate this process and refresh dashboards hourly.

Key Stakeholders:

Sales Directors: Need to identify top-performing products and regions
Marketing Teams: Require customer segmentation for targeted campaigns
Finance: Need accurate revenue reporting by product category
Operations: Must monitor inventory turnover rates

Text File Input

The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.

The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.