MinIO
Hands-on workshops using MinIO as an S3-compatible object store.
Workshop series: PDI + MinIO (S3)
Build hands-on Pentaho Data Integration (PDI) transformations that read from and write to MinIO using VFS.
Workshops get harder as you go. Start with CSV joins. Then move into XML/JSON parsing, reconciliation, and multi-format ingestion.
Workshops in this series
Sales Dashboard (CSV inputs + lookups + output)
Inventory Reconciliation (XML + CSV + variance detection)
Customer 360 (multi-source joins + metrics)
Clickstream Funnel (sessionization + pivoting)
Log Parsing (regex + time-series checks)
Data Lake Ingestion (schema normalization + validation)
You’ll practice
Connecting to MinIO buckets with VFS
Reading and writing objects with
pvfs://MinIO/...pathsJoining and enriching streams (lookups and joins)
Parsing XML and JSON
Validating and shaping data for a curated layer
Prerequisites: MinIO running with sample data populated; basic transformation concepts; basic joins and aggregations
Estimated time: 4–6 hours total (each workshop is ~20–60 minutes)
Sales Dashboard
joins, lookups, aggregations
Inventory Reconciliation
XML parsing, outer joins, variance
Customer 360
multi-source, JSONL, calculations
Clickstream Funnel
sessionization, pivoting
Log Parsing
regex, time-series analysis
Data Lake Ingestion
schema normalization, validation
Complete the setup first: Storage: MinIO
Verify that MinIO is running and populated.
Start Pentaho Data Integration.
Start Pentaho Data Integration (Spoon).
Sales Dashboard
The workshop demonstrates how Pentaho Data Integration enables organizations to rapidly create denormalized fact tables that power real-time business intelligence dashboards. By integrating data from multiple sources (customer data, product catalogs, and sales transactions), business users gain immediate access to actionable insights without waiting for IT to build complex data warehouses.
Scenario: A mid-sized e-commerce company needs to track daily sales performance across products, customer segments, and regions. Currently, sales managers wait 24-48 hours for IT to generate reports from disparate systems. With PDI, they can automate this process and refresh dashboards hourly.
Key Stakeholders:
Sales Directors: Need to identify top-performing products and regions
Marketing Teams: Require customer segmentation for targeted campaigns
Finance: Need accurate revenue reporting by product category
Operations: Must monitor inventory turnover rates
Workshop files
These files are already in MinIO:
pvfs://MinIO/raw-data/csv/sales.csvpvfs://MinIO/raw-data/csv/products.csvpvfs://MinIO/raw-data/csv/customers.csv
Output path used later: pvfs://MinIO/staging/dashboard/

Create a new transformation.
Use any of these options:
Select File > New > Transformation
Use
Ctrl+N(Windows/Linux) orCmd+N(macOS)
Follow the steps to create the transformation:
Text File Input
The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.
The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

VFS connection names are case-sensitive. These examples assume your connection name is MinIO.
Drag & drop 3 Text File Input Steps onto the canvas.
Save transformation as:
sales_dashboard_etl.ktrin your workshop folder.
Sales (Order Management)
Double-click on the first TFI step, and configure with the following properties:
Step name
Sales
Filename
pvfs://MinIO/raw-data/csv/sales.csv
Delimiter
,
Head row present
✅
Format
mixed

Click: Get Fields to auto-detect columns.
Business Logic: Note that sale_amount may differ from price * quantity due to:
Volume discounts
Promotional pricing
Customer-specific pricing tiers
Currency conversion (for international sales)

Preview data.

Business Significance:
sale_amount: Actual revenue (may include discounts)quantity: Volume metrics for demand planningpayment_method: Payment preference insightsstatus: Filter out cancelled/refunded orders
Products (ERP system)
Double-click on the second TFI step, and configure with the following properties:
Step name
Products
Filename
pvfs://MinIO/raw-data/csv/products.csv
Delimiter
,
Head row present
✅
Format
mixed

Click: Get Fields to auto-detect columns.

Preview the data.

Business Significance:
category: Enables product performance analysis by segmentprice: Base pricing for margin calculationsstock_quantity: Inventory turnover insights
Customers (CRM System)
Double-click on the third TFI step, and configure with the following properties:
Step name
Customers
Filename
pvfs://MinIO/raw-data/csv/customers.csv
Delimiter
,
Header row present
✅
Format
mixed

Click: Get Fields to auto-detect columns.

Preview the data.

Business Significance:
customer_id: Primary key for joining to salescountry: Critical for geographic segmentationstatus: Identifies churned vs. active customersregistration_date: Enables customer tenure analysis
Stream Lookup
A Stream lookup step enriches rows by looking up matching values from another stream.
In a transformation, you feed your main rows into one hop and a reference dataset into the other hop. The step then matches rows using key fields and returns the lookup fields on the output. It’s the in-memory alternative to a database lookup, but the reference stream must be available in the same transformation flow.

Drag & drop 2 Stream lookup steps onto the canvas.
Save transformation as:
sales_dashboard_etl.ktrin your workshop folder.
Product Lookup
Draw a hop between Sales and Product Lookup.
Draw a hop between Products and Product Lookup.
The Sales is acting as our Fact table. It holds the transaction data for our Products & Customers.
Double-click on the 'Product Lookup' step, and configure with the following properties:
General
Step name
Product Lookup
General
Lookup step
Products
Keys
Field (from Sales)
product_id
Keys
Field (from Products)
product_id
In Values to retrieve, add:
product_name(rename toproduct_name)category(rename toproduct_category)price(rename tounit_price)

Customers Lookup
Draw a hop between Product Lookup and Customers Lookup.
Draw a hop between Customers and Customers Lookup.
Double-click Customers Lookup, and configure the following properties:
Step name
Customers Lookup
Lookup step
Customers
Key field (stream)
customer_id
Key field (lookup)
customer_id
Values to retrieve:
first_namelast_namecountry(rename tocustomer_country)status(rename tocustomer_status)

Preview data
Save the transformation.
RUN & Preview the data.

Calculator
The Calculator step provides predefined functions that you can run on input field values. Use Calculator as a quick alternative to custom JavaScript for common calculations.
To use Calculator, specify the input fields and the calculation type, and then write results to new fields. You can also remove temporary fields from the output after all values are calculated.

Drag & drop a 'Calculator' step onto the canvas.
Draw a Hop from the 'Customers Lookup' step to the 'Calculator' step.
Double-click on the 'Calculator' step, and configure the following properties:
line_total
A * B
quantity
unit_price
Number
discount_amount
A - B
sale_amount
line_total
Number

Preview data
Save the transformation.
RUN & Preview the data.

Business Insight Enabled:
Positive
discount_amount: Customer received a discount (common)Negative
discount_amount: Customer paid more than list price (expedite, premium, etc.)Zero
discount_amount: Sold at list price
Formula
The Formula step can calculate Formula Expressions within a data stream. It can be used to create simple calculations like [A]+[B] or more complex business logic with a lot of nested if / then logic.

Drag & drop a 'Formula' step onto the canvas.
Draw a Hop from the 'Calculator' step to the 'Formula' step.
Double-click on the 'Formula' step, and configure the following properties:
customer_full_name
CONCATENATE([first_name];" ";[last_name])
is_high_value
IF([sale_amount]>500;"Yes";"No")

Preview data
Save the transformation.
RUN & Preview the data.

Business Applications:
is_high_value: Trigger VIP customer service workflows
Add Constants
The Add constant values step is a simple and high performance way to add constant values to the stream.

Drag & drop 'Add constants' step onto the canvas.
Draw a Hop from the 'Formula' step to the 'Add constants ' step.
Double-click on the 'Add constants' step, and configure the following properties:
data_source
String
minio_workshop

Get system info
This step retrieves system information from the Kettle environment. The step includes a table where you can designate a name and assign it to any available system info type you want to retrieve. This step generates a single row with the fields containing the requested information.
It can also accept any number of input streams, aggregate any fields defined by this step, and send the combined results to the output stream.

Drag & drop 'Get system info' step onto the canvas.
Draw a Hop from the 'Add constants' step to the 'Get system info ' step.
Double-click on the Get system info step, and configure the following properties:
etl_timestamp
system date (variable)

Select Values
The Select Values step can perform all the following actions on fields in the PDI stream:
Select fields - The Select Values step can perform all the following actions on fields in the PDI stream.
Remove fields - Use this tab to remove fields from the input stream.
Meta-data - Use this tab to change field types, lengths, and formats.

Drag & drop a 'Select values' step onto the canvas.
Draw a Hop from the 'Get system info' step to the 'Select values' step.
Double-click on the 'Select values' step, and configure the following properties:
On Select & Alter tab, choose fields in order:
sale_id
sale_date
customer_id
customer_full_name
customer_country
customer_status
product_id
product_name
product_category
quantity
unit_price
sale_amount
line_total
discount_amount
is_high_value
payment_method
status (rename to
sale_status)etl_timestamp
data_source

Preview data
Save the transformation.
RUN & Preview the data.

Text file output
The Text File Output step exports rows to a text file.
This step is commonly used to generate delimited files (for example, CSV) that can be read by spreadsheet applications, and it can also generate fixed-length output.
You can’t run this step in parallel to write to the same file.
If you need to run multiple copies, select Include stepnr in filename and merge the resulting files afterward.

Drag & drop a Text file output step onto the canvas.
Draw a Hop from the 'Select values' step to the 'Write to staging' step.
Double-click on the 'Write to staging' step, and configure with the following properties:
Step name
Write to Staging
Filename
pvfs://MinIO/staging/dashboard/sales_fact
Extension
csv
Include date/time in filename
✅
Separator
,
Add header
✅
Select Get fields to populate the output fields.
Business Benefit: Timestamped files enable:
Historical tracking: "What did the data look like last Tuesday?"
Incremental processing: Keep processing latest file without overwriting history
Rollback capability: "The 3pm run had bad data, revert to 2pm version"
MinIO
Save the transformation.
Log into MinIO:

Checklist
Inventory Reconciliation - XML + CSV Integration
This workshop demonstrates how Pentaho Data Integration eliminates costly inventory discrepancies by automatically reconciling data between warehouse management systems (XML feeds) and ERP product catalogs (CSV files). Organizations lose millions annually due to inventory inaccuracies, stockouts, and overstocking. PDI's ability to parse complex XML and perform full outer joins enables real-time discrepancy detection that would require hours of manual spreadsheet work.
Business Value Delivered:
Cost Reduction: Eliminate manual reconciliation labor ($75K-150K annually per analyst)
Inventory Optimization: Reduce excess inventory carrying costs by 15-25%
Stockout Prevention: Identify missing items before customers notice
Compliance: Audit trail for SOX, ISO 9001, and supply chain regulations
Real-Time Visibility: Know your actual inventory position within minutes, not days
Scenario: A manufacturing company operates 12 distribution warehouses. Each warehouse uses a legacy WMS (Warehouse Management System) that exports XML inventory files nightly. The corporate ERP system maintains a CSV product master catalog. Discrepancies cause:
Phantom stock: ERP shows item in stock, warehouse says it's not → Lost sales
Ghost inventory: Warehouse has items ERP doesn't recognize → Dead capital
Quantity variances: Mismatches of 10+ units trigger expensive physical counts
Key Stakeholders:
Supply Chain Directors: Need accurate inventory positions across all locations
Warehouse Managers: Require daily reconciliation reports to prioritize cycle counts
Finance Teams: Must report accurate inventory valuations for financial statements
Procurement: Need to identify slow-moving items and prevent overstocking
Workshop files
These files are already in MinIO:
pvfs://MinIO/raw-data/xml/inventory.xmlpvfs://MinIO/raw-data/csv/products.csv
Planned output path: pvfs://MinIO/staging/inventory/reconciliation/

Create a new transformation.
Use any of these options:
Select File > New > Transformation
Use
Ctrl+N(Windows/Linux) orCmd+N(macOS)
Follow the steps to create the transformation:
Drag & drop 'Get data from XML' onto the canvas.
Save transformation as:
inventory_reconciliation.ktrin your workshop folder.Double-click on the 'Get data from XML' step, and configure with the following properties:
Step name
Read Warehouse XML
File or directory
pvfs://MinIO/raw-data/xml/inventory.xml
Loop XPath
/inventory/items/item
Encoding
UTF-8
Ignore comments
✅
Validate XML
No
Ignore empty file
✅
XPath Explanation:
/inventory= Start at root element/items= Navigate to items container/item= Loop over each item element
Browse & Add the path to the inventory.xml
Click on the Content tab

Click on the Fields tab & Get Fields.
Remap the fields & Preview rows.
Business Field Naming:
Prefix with
warehouse_to distinguish from ERP fields laterwarehouse_quantityvs.stock_quantitymakes joins clearerKeep original field names in a data dictionary for auditing
warehouse_item_name
name
warehouse_quantity
quantity
warehouse_location
location
last_physical_count
last_checked

Next: configure the product catalog input, then join the two streams.
Status: Draft. Add a Text file input step for pvfs://MinIO/raw-data/csv/products.csv.
Status: Draft. Join warehouse items to the ERP product master using a full outer join.
Status: Draft. Write discrepancy rows to pvfs://MinIO/staging/inventory/reconciliation/.
Customer 360
Create unified customer profiles combining demographic data, purchase history, and behavioral events.
Skills: Multiple joins, JSONL parsing, aggregations, calculated metrics

Workshop files
Current draft inputs (already in MinIO):
pvfs://MinIO/raw-data/csv/customers.csvpvfs://MinIO/raw-data/csv/sales.csv
Status: Draft. This workshop is incomplete.
Create a new transformation.
Use any of these options:
Select File > New > Transformation
Use
Ctrl+N(Windows/Linux) orCmd+N(macOS)
Drag & drop 'Text file input' steps onto the canvas.
Save transformation as:
customer_360.ktrin your workshop folder.
Sales (Order Management)
Double-click on the first TFI step, and configure with the following properties:
Step name
Sales
Filename
pvfs://MinIO/raw-data/csv/sales.csv
Delimiter
,
Head row present
✅
Format
mixed

Click: Get Fields to auto-detect columns.
Business Logic: Note that sale_amount may differ from price * quantity due to:
Volume discounts
Promotional pricing
Customer-specific pricing tiers
Currency conversion (for international sales)

Preview data.

Business Significance:
sale_amount: Actual revenue (may include discounts)quantity: Volume metrics for demand planningpayment_method: Payment preference insightsstatus: Filter out cancelled/refunded orders
Drag & drop 'Sort rows' steps onto the canvas.
Create a Hop between 'Read Customers' & 'Sort rows'.
Double-click on 'Sort rows' and configure the sort keys.
Status: Draft. Define sales-level aggregations (for example, total spend per customer).
Status: Draft. Join the customer, sales, and user event streams.
Status: Draft. Create one row per customer and write to pvfs://MinIO/staging/customer360/.
Transactions & Fraud Detection
Objective: Process credit card transactions, enrich with account and merchant data, calculate transaction metrics, and detect suspicious patterns using rule-based fraud detection.
Skills: Financial data processing, multi-table joins, running totals, rule-based fraud detection, transaction velocity analysis
Business Context: A payment processor needs to analyze transaction data in real-time to detect potentially fraudulent activity before authorizing transactions. The system must flag high-risk transactions based on amount thresholds, unusual merchant activity, account balance checks, and transaction velocity patterns.
Workshop files
Status: Draft. Sample files are not published yet.
Status: Draft. Steps coming soon.
Data Lake Ingestion
Modern data lakes often receive the same entities (products, customers, orders) from multiple sources in different formats. This workshop demonstrates how to ingest, normalize, validate, and deduplicate multi-format data into a unified schema - a common data engineering pattern.
Objective: Combine data from CSV, JSON, and XML into a unified product schema.
Skills: Multi-format parsing, schema normalization, data validation, deduplication
Workshop files
These files are already in MinIO:
pvfs://MinIO/raw-data/csv/products.csvpvfs://MinIO/raw-data/json/api_response.jsonpvfs://MinIO/raw-data/xml/inventory.xml
Create a new transformation.
Use any of these options:
Select File > New > Transformation
Use
Ctrl+N(Windows/Linux) orCmd+N(macOS)
Define Target Schema
Objective: Design a unified schema that accommodates all source formats.
Why Important: Before ingesting data, you need a clear target schema. This ensures consistency across all sources and makes downstream analytics easier.
product_id
String
50
Unique product identifier
CSV: product_id JSON: product_id XML: sku
product_name
String
200
Product display name
CSV: product_name JSON: product_name XML: name
category
String
100
Product category
CSV: category JSON: (derived from order type) XML: category
price
Number
15,2
Unit price in USD
CSV: price JSON: unit_price XML: null (not available)
quantity
Integer
10
Available stock quantity
CSV: stock_quantity JSON: quantity XML: quantity
source_system
String
10
Origin system identifier
Constant: 'csv', 'json', or 'xml'
ingestion_time
Timestamp
-
When record was ingested
System timestamp
Schema Discovery & Analysis
Objective: Understand each source structure before you design the target schema.
Why it matters: You can’t normalize what you haven’t inspected.
Inspect each Data Source
Use real samples. Avoid guessing field names.
Findings
Has
product_id,product_name,category,price,stock_quantity.Completeness looks high.
Naming is consistent and explicit.
Findings
Has
product_idandproduct_name.Uses
unit_priceinstead ofprice.quantityis order quantity, not stock.categoryis missing.Path is
$.data.orders[*].items[*].
Findings
Uses
skuforproduct_id.Uses
nameforproduct_name.Has
categoryand warehousequantity.priceis missing.locationis extra for a product master.
Build a field mapping matrix
This shows name differences and missing fields.
Identifier
product_id
product_id
sku
Same meaning. Different name in XML.
Name
product_name
product_name
name
Same meaning. Different name in XML.
Category
category
❌
category
Missing in JSON.
Price
price
unit_price
❌
Different name in JSON. Missing in XML.
Stock quantity
stock_quantity
quantity
quantity
JSON quantity is not stock.
What to watch
Missing data is normal in multi-source ingestion.
Same name can mean different things.
Make schema decisions
Write these down. You will forget them later.
Field names
Use CSV naming as the standard.
Map XML
sku → product_idandname → product_name.Map JSON
unit_price → price.
Missing fields
Missing
categoryin JSON: set a default likeE-commerce.Missing
pricein XML: leaveNULL.
Data types
product_id: string. It containsPROD-prefix.product_name: string. Allow up to 200 chars.category: string. Allow up to 100 chars.price: decimal(15,2).quantity: integer.
Metadata
Add
source_systemfor lineage.Add
ingestion_timefor auditability.
Define a deduplication rule
Same product_id can appear in multiple sources.
Recommended rule
Prefer CSV.
Then JSON.
Then XML.
Implement this with source_priority (CSV=1, JSON=2, XML=3).
Checklist
You inspected real records for each source.
You captured paths for nested formats.
You documented mappings and type choices.
You decided how to handle missing data.
You decided how to dedupe collisions.
Path convention used below: pvfs://MinIO/...
MinIO is the VFS connection name. It must match your connection exactly.
Ingest CSV products
Goal: Read products.csv and map it to the unified schema.
Path: pvfs://MinIO/raw-data/csv/products.csv
Add a Text file input step.
Step name:
Read CSV ProductsFile/directory:
pvfs://MinIO/raw-data/csv/products.csvSeparator:
,Enclosure:
"(double quote)Header row present: enabled
On Fields, select Get Fields.
Add a Select values step.
Step name:
Map CSV to Target SchemaRename
stock_quantity→quantity
Add Add constants.
Step name:
Add CSV MetadataAdd field
source_system=csv
Add Get System Info.
Step name:
Add Ingestion TimestampAdd field
ingestion_time=system date (variable)
Preview check
Expected rows:
12product_id,product_name,categoryshould be populated.
Ingest JSON order items
Goal: Extract product fields from nested JSON order items.
Path: pvfs://MinIO/raw-data/json/api_response.json
quantity in JSON is order quantity, not stock quantity.
Keep it as quantity only if that’s what you want to model.
Add a JSON Input step.
Step name:
Read JSON ProductsFile:
pvfs://MinIO/raw-data/json/api_response.jsonIgnore empty file: enabled
On Fields, use explicit JSONPaths (recommended):
product_id:$.data.orders[*].items[*].product_idproduct_name:$.data.orders[*].items[*].product_nameunit_price:$.data.orders[*].items[*].unit_pricequantity:$.data.orders[*].items[*].quantity
Alternative approach (base path + relative field paths)
If your PDI build supports a base “Path” for the JSON Input step, set:\n\n- Base path: $.data.orders[*].items[*]\n\nThen set field paths relative to the base:\n\n- product_id: product_id\n- product_name: product_name\n- unit_price: unit_price\n- quantity: quantity\n
Add a Select values step.
Step name:
Map JSON to Target SchemaRename
unit_price→price
Add Add constants.
Step name:
Add JSON Metadatasource_system=jsoncategory=E-commerce(default)
Add Get System Info.
Step name:
Add JSON Ingestion Timestampingestion_time=system date (variable)
Preview check
Expected rows:
~10–15(can vary with sample file).product_nameshould not be NULL.
Ingest XML inventory items
Goal: Extract inventory items from XML using XPath.
Path: pvfs://MinIO/raw-data/xml/inventory.xml
Add Get data from XML.
Step name:
Read XML ProductsFile:
pvfs://MinIO/raw-data/xml/inventory.xmlLoop XPath:
/inventory/items/item
On Fields, add:
sku(String)name(String)category(String)quantity(Integer)
Field XPaths are relative to the loop node.
Example: sku means “read the <sku> element under each <item>”.
Add a Select values step.
Step name:
Map XML to Target SchemaRename
sku→product_idRename
name→product_nameAdd a new field
pricein Meta-data (typeNumber). Leave it empty (NULL).
Add Add constants.
Step name:
Add XML Metadatasource_system=xml
Add Get System Info.
Step name:
Add XML Ingestion Timestampingestion_time=system date (variable)
Preview check
Expected rows:
~8–10If you get
0rows, re-check the Loop XPath.
Merge Streams
Objective: Merge all three data streams (CSV, JSON, XML) into one unified stream.
Why Append Streams: This step stacks all rows from different sources vertically - like a SQL UNION ALL.
Configuration:
Add Append streams step
Name: "Combine All Products"
Connect all three streams to this step:
"Add Ingestion Timestamp" (CSV branch) → Append streams
"Add JSON Ingestion Timestamp" (JSON branch) → Append streams
"Add XML Ingestion Timestamp" (XML branch) → Append streams
Important: All input streams MUST have the same fields with the same names and types:
product_id (String)
product_name (String)
category (String)
price (Number) - can be null
quantity (Integer)
source_system (String)
ingestion_time (Timestamp)
Expected Output:
Row count: ~30-35 rows (12 CSV + 10-15 JSON + 8-10 XML)
All products from all sources combined
Some products will appear multiple times (duplicates to be handled in Step 7)
Preview Check:
Data Validation
Objective: Validate data quality and route bad records to error handling.
Why Important: Multi-source data often has quality issues. Better to catch and handle them explicitly than have them cause downstream failures.
Configuration:
Add Data Validator step
Name: "Validate Product Data"
Validations tab - Add validation rules:
FieldnameValidation TypeConfigurationError Messageproduct_id
NOT NULL
Product ID is required
product_id
NOT EMPTY STRING
Product ID cannot be empty
product_name
NOT NULL
Product name is required
product_name
NOT EMPTY STRING
Product name cannot be empty
price
NUMERIC RANGE
Min: 0, Max: 999999
Price must be >= 0 (if present)
quantity
NUMERIC RANGE
Min: 0, Max: 999999
Quantity must be >= 0
Options tab:
☑ Concatenate errors: Shows all validation errors for a row
Separator:
,(comma-space)☑ Output all errors as one field:
validation_errors
Add Filter rows step after Data Validator
Name: "Route Valid vs Invalid"
Condition:
True (valid records) → Continue to deduplication
False (invalid records) → Error output
Add Text file output for errors (connect from False branch):
Name: "Write Error Records"
Filename:
pvfs://MinIO/curated/products/errors/validation_errors_${Internal.Job.Start.Date.yyyyMMdd}.csvInclude date in filename: Helps track when errors occurred
Fields to output: All fields +
validation_errors
Expected Output:
Valid records: ~95-100% should pass (25-35 rows)
Invalid records: 0-5% to error file (0-2 rows)
Common Validation Failures:
Empty product_id or product_name
Negative price or quantity values
Non-numeric values in numeric fields
Last updated
Was this helpful?

