MinIO
Access S3 type Object Store -VFS ..
Workshop - MinIO
MinIO is a high-performance, Kubernetes-native object storage system designed for cloud-native applications. Built from the ground up to be compatible with Amazon S3, MinIO offers a lightweight yet powerful alternative for organizations looking to deploy object storage in their own infrastructure.
At its core, MinIO provides distributed object storage with performance characteristics. It's capable of handling millions of operations per second and can store petabytes of data while maintaining sub-millisecond latency. This performance is achieved through a simplified architecture that eliminates complex dependencies and optimizes for modern hardware capabilities.
One of MinIO's key strengths lies in its versatility. It can be deployed virtually anywhere - from bare metal servers to public, private, and edge cloud environments. Organizations particularly value its seamless integration with Kubernetes, making it an ideal choice for containerized environments. MinIO's open-source nature also provides transparency and flexibility that many enterprises require for their data infrastructure needs.

Please ensure you have completed the following setup: MinIO
To create a new Transformation
Any one of these actions opens a new Transformation tab for you to begin designing your transformation.
By clicking File > New > Transformation
By using the CTRL-N hot key
Log into MinIO.
Username: minioadmin
Password: minioadmin
If you have completed the setup: MinIO then you should have pre-populated buckets with various data objects in different formats.

New Bucket
If you need to create a Bucket:
Click the 'Create Bucket' link.
Enter: sales-data & 'Create Bucket'.

Click on the Upload button.

Upload your data - for exmple some sales data:
Windows
C:\Pentaho\design-tools\data-integration\samples\transformations\files
Linux
~/Pentaho/design-tools/data-integration/samples/transformations/files

Virtual File Systems
PDI allows you to establish connections to most Virtual File Systems (VFS) through VFS connections. These connections store the necessary properties to access specific file systems, eliminating the need to repeatedly enter configuration details.
Once you've added a VFS connection in PDI, you can reference it whenever you need to work with files or folders on that Virtual File System. This streamlines your workflow by allowing you to reuse connection information across multiple steps.
For instance, if you're working with Hitachi Content Platform (HCP), you can create a single VFS connection and then use it throughout all HCP transformation steps. This approach saves time and ensures consistency by removing the need to re-enter credentials or access information for each data operation.
Start Pentaho Data Integration.
Windows - PowerShell
Linux
Create a VFS connection to the MinIO buckets
Click: 'View' Tab.
Right mouse click on VFS Connections > New.

Enter the following details:

Connection Name
MinIO
Connection Type
Minio/HCP
Description
Connection to sales-data bucket
S3 Connection Type
Minio/HCP
Access Key
minioadmin
Secret Key
minioadmin
Endpoint
http://localhost:9000 [MinIO API endpoint]
Signature Version
AWSS3V4SignerType
PathStyle Acess:
enable
Root Folder Path
/
Test the connection.
Workshops
Pentaho Data Integration: MinIO Object Storage Workshop Series
Modern organizations increasingly store their data in cloud-native object storage systems like MinIO and Amazon S3, moving away from traditional file servers and databases. This architectural shift enables scalable, cost-effective data lakes but introduces new challenges: data arrives in multiple formats (CSV, JSON, XML, Parquet), exists across distributed buckets, and requires sophisticated transformation pipelines to unlock its analytical value. Learning to efficiently extract, transform, and integrate data from object storage is now essential for any data integration professional working with contemporary data architectures.
In this comprehensive workshop series, you'll build progressively complex transformation pipelines that leverage MinIO object storage as both a source and destination for enterprise data integration scenarios. Starting with fundamental ETL patterns like denormalized fact table creation, you'll advance through intermediate challenges involving multi-format parsing and reconciliation, ultimately mastering advanced techniques like sessionization, anomaly detection, and schema normalization across heterogeneous data sources.
Each workshop introduces real-world business scenarios - from sales dashboards to customer analytics to operational monitoring - demonstrating PDI's versatility in solving diverse integration challenges while maintaining cloud-native architecture principles.
What You'll Accomplish:
Configure VFS (Virtual File System) connections to access S3-compatible MinIO object storage
Build multi-source ETL pipelines using Text File Input steps with S3 paths (s3a://)
Implement Stream Lookup and Merge Join patterns to enrich data from multiple CSV sources
Parse semi-structured formats including XML inventory feeds and JSONL event streams
Apply full outer joins to identify discrepancies between warehouse and catalog systems
Aggregate customer data across transactional, demographic, and behavioral dimensions
Perform sessionization and funnel analysis on clickstream data using Group By and pivoting
Extract structured data from unstructured logs using Regular Expression evaluation
Detect anomalies in time-series data through rolling averages and conditional logic
Normalize schemas across CSV, JSON, and XML sources into unified data lake structures
Implement data validation, deduplication, and quality controls for multi-format ingestion
Calculate derived metrics including customer lifetime value, engagement scores, and conversion rates
Route data dynamically using Switch/Case and Filter Rows for conditional processing
Output transformed data to staging and curated layers following data lake architecture patterns
By the end of this workshop series, you'll have mastered the complete spectrum of cloud-native data integration patterns using Pentaho Data Integration. You'll understand how to handle diverse source formats, implement sophisticated join and aggregation logic, perform advanced text parsing and time-series analysis, and build production-ready pipelines that leverage object storage for scalable, distributed data processing.
Instead of treating each data format as a unique challenge requiring custom scripts, you'll confidently design reusable, visual transformation workflows that automate complex integration scenarios - from operational reconciliation to customer intelligence to real-time anomaly detection.
Prerequisites: MinIO running with sample data populated; basic understanding of transformation concepts (steps, hops, preview); familiarity with joins and aggregations
Estimated Time: 4-6 hours total (individual workshops range from 20-60 minutes based on complexity)
Sales Dashboard
joins, lookups, aggregations
Inventory Reconciliation
XML parsing, outer joins, variance
Customer 360
multi-source, JSONL, calculations
Clickstream Funnel
sessionization, pivoting
Log Parsing
regex, time-series analysis
Data Lake Ingestion
schema normalization, validation
1. Verify that MinIO is running and populated.
Start Pentaho Data Integration.
Windows - PowerShell:
Linux:
Workshops
Sales dashboard
The workshop demonstrates how Pentaho Data Integration enables organizations to rapidly create denormalized fact tables that power real-time business intelligence dashboards. By integrating data from multiple sources (customer data, product catalogs, and sales transactions), business users gain immediate access to actionable insights without waiting for IT to build complex data warehouses.
Scenario: A mid-sized e-commerce company needs to track daily sales performance across products, customer segments, and regions. Currently, sales managers wait 24-48 hours for IT to generate reports from disparate systems. With PDI, they can automate this process and refresh dashboards hourly.
Key Stakeholders:
Sales Directors: Need to identify top-performing products and regions
Marketing Teams: Require customer segmentation for targeted campaigns
Finance: Need accurate revenue reporting by product category
Operations: Must monitor inventory turnover rates

Text File Input
The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.
The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

Drag & drop 3 Text File Input Steps onto the canvas.
Save transformation as:
sales_dashboard_etl.ktrin your workshop folder.
Customers
Double-click on the first TFI step, and configure the following properties:
Step name
Customers
Filename
pvfs://MinIO/raw-data/csv/customers.csv
Delimiter
,
Header row present
Yes
Format
mixed

Click: Get Fields to auto-detect columns.

Preview the data.

Business Significance:
customer_id: Primary key for joining to salescountry: Critical for geographic segmentationstatus: Identifies churned vs. active customersregistration_date: Enables customer tenure analysis
Products (ERP system)
Double-click on the second TFI step, and configure the following properties:
Step name
Products
Filename
pvfs://Minio/raw-data/csv/products.csv
Delimiter
,
Head row present
Yes
Format
mixed

Click: Get Fields to auto-detect columns.

Preview the data.

Business Significance:
category: Enables product performance analysis by segmentprice: Base pricing for margin calculationsstock_quantity: Inventory turnover insights
Sales (Order Management)
Double-click on the second TFI step, and configure the following properties:
Step name
Sales
Filename
pvfs://Minio/raw-data/csv/sales.csv
Delimiter
,
Head row present
Yes
Format
mixed

Click: Get Fields to auto-detect columns.

Preview data.

Business Significance:
sale_amount: Actual revenue (may include discounts)quantity: Volume metrics for demand planningpayment_method: Payment preference insightsstatus: Filter out cancelled/refunded orders
Stream Lookup
A Stream lookup step enriches rows by looking up matching values from another stream.
In a transformation, you feed your main rows into one hop and a reference dataset into the other hop. The step then matches rows using key fields and returns the lookup fields on the output. It’s the in-memory alternative to a database lookup, but the reference stream must be available in the same transformation flow.

Drag & drop 2 Stream Lookups Input Steps onto the canvas.
Save transformation as:
sales_dashboard_etl.ktrin your workshop folder.
Product Lookup
Draw a Hop bewteen 'Sales' step & 'Product Lookup' step.
Draw a Hop bewteen 'Product' step & 'Product Lookup' step.
The Sales is acting as our Fact table. It holds the transaction data for our Products & Customers.
Double-click on the 'Product Lookup' step, and configure the following properties:
General
Step name
Product Lookup
General
Lookup step
Products
Keys
Field (from Sales)
product_id
Keys
Field (from Products)
product_id
In Values to retrieve, add:
product_name(rename toproduct_name)category(rename toproduct_category)price(rename tounit_price)

Customers Lookup
Draw a Hop bewteen 'Product Lookup' step & 'Customers Lookup' step.
Draw a Hop bewteen 'Customers' step & 'Customers Lookup' step.
Double-click on the 'Customer Lookup' step, and configure the following properties:
Step name
Customers Lookup
Lookup step
Customers
Key field (stream)
customer_id
Key field (lookup)
customer_id
Values to retrieve:
first_namelast_namecountry(rename tocustomer_country)status(rename tocustomer_status)


Preview data
Sve the transformation.
RUN & Preview the data.


Drag & drop a 'Calculator' step onto the canvas.
Draw a Hop from the 'Customers Lookup' step to the 'Calculator' step.
Double-click on the 'Calculator' step, and configure the following properties:
line_total
A * B
quantity
unit_price
Number
profit_margin
A - B
sale_amount
line_total
Number

Preview data
Sve the transformation.
RUN & Preview the data.

Drag & drop a 'Formula' step onto the canvas.
Draw a Hop from the 'Calculator' step to the 'Formula' step.
Double-click on the 'Formula' step, and configure the following properties:
customer_full_name
CONCATENATE([first_name];" ";[last_name])
is_high_value
IF([sale_amount]>500;"Yes";"No")

Preview data
Sve the transformation.
RUN & Preview the data.


Drag & drop a 'Select values' step onto the canvas.
Draw a Hop from the 'Get system info' step to the 'Select values' step.
Double-click on the 'Select values' step, and configure the following properties:
On Select & Alter tab, choose fields in order:
sale_id
sale_date
customer_id
customer_full_name
customer_country
customer_status
product_id
product_name
product_category
quantity
unit_price
sale_amount
line_total
profit_margin
is_high_value
payment_method
status (rename to
sale_status)etl_timestamp
data_source

Preview data
Sve the transformation.
RUN & Preview the data.


Drag & drop a 'Select values' step onto the canvas.
Draw a Hop from the 'Select values' step to the 'Write to staging' step.
Double-click on the 'Write to staging' step, and configure the following properties:
Step name
Write to Staging
Filename
pvfs://MinIO/staging/dashboard/sales_fact
Extension
csv
Inc/sales_factlude date in filename
Yes
Separator
,
Add header
Yes
Remember to: Get Fields
MinIO
Sve the transformation.
Log into MinIO:

Checklist
Last updated
Was this helpful?





