MinIO

Hands-on workshops using MinIO as an S3-compatible object store.

Workshop series: PDI + MinIO (S3)

Build hands-on Pentaho Data Integration (PDI) transformations that read from and write to MinIO using VFS.

Workshops get harder as you go. Start with CSV joins. Then move into XML/JSON parsing, reconciliation, and multi-format ingestion.

Workshops in this series

Sales Dashboard (CSV inputs + lookups + output)
Inventory Reconciliation (XML + CSV + variance detection)
Customer 360 (multi-source joins + metrics)
Clickstream Funnel (sessionization + pivoting)
Log Parsing (regex + time-series checks)
Data Lake Ingestion (schema normalization + validation)

You’ll practice

Connecting to MinIO buckets with VFS
Reading and writing objects with pvfs://MinIO/... paths
Joining and enriching streams (lookups and joins)
Parsing XML and JSON
Validating and shaping data for a curated layer

Prerequisites: MinIO running with sample data populated; basic transformation concepts; basic joins and aggregations

Estimated time: 4–6 hours total (each workshop is ~20–60 minutes)

Workshop

Key Skills

Sales Dashboard

joins, lookups, aggregations

Inventory Reconciliation

XML parsing, outer joins, variance

Customer 360

multi-source, JSONL, calculations

Clickstream Funnel

sessionization, pivoting

Log Parsing

regex, time-series analysis

Data Lake Ingestion

schema normalization, validation

Complete the setup first: Storage: MinIO

Verify that MinIO is running and populated.

# Check MinIO is running
curl -sf http://localhost:9000/minio/health/live && echo "MinIO OK" || echo "MinIO not running"

# Verify data exists (using mc client)
mc ls minio-local/raw-data --recursive

Start Pentaho Data Integration.

Start Pentaho Data Integration (Spoon).

Set-Location C:\Pentaho\design-tools\data-integration
.\spoon.bat

Sales Dashboard

The workshop demonstrates how Pentaho Data Integration enables organizations to rapidly create denormalized fact tables that power real-time business intelligence dashboards. By integrating data from multiple sources (customer data, product catalogs, and sales transactions), business users gain immediate access to actionable insights without waiting for IT to build complex data warehouses.

Scenario: A mid-sized e-commerce company needs to track daily sales performance across products, customer segments, and regions. Currently, sales managers wait 24-48 hours for IT to generate reports from disparate systems. With PDI, they can automate this process and refresh dashboards hourly.

Key Stakeholders:

Sales Directors: Need to identify top-performing products and regions
Marketing Teams: Require customer segmentation for targeted campaigns
Finance: Need accurate revenue reporting by product category
Operations: Must monitor inventory turnover rates

Workshop files

These files are already in MinIO:

pvfs://MinIO/raw-data/csv/sales.csv
pvfs://MinIO/raw-data/csv/products.csv
pvfs://MinIO/raw-data/csv/customers.csv

Output path used later: pvfs://MinIO/staging/dashboard/

46KB

sales_dashboard_etl.ktr

Open

Create a new transformation.

Use any of these options:

Select File > New > Transformation
Use Ctrl+N (Windows/Linux) or Cmd+N (macOS)

Follow the steps to create the transformation:

Text File Input

The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.

The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

VFS connection names are case-sensitive. These examples assume your connection name is MinIO.

Drag & drop 3 Text File Input Steps onto the canvas.
Save transformation as: sales_dashboard_etl.ktr in your workshop folder.

Sales (Order Management)

Double-click on the first TFI step, and configure with the following properties:

Setting

Value

Step name

Sales

Filename

pvfs://MinIO/raw-data/csv/sales.csv

Delimiter

Head row present

✅

Format

mixed

Click: Get Fields to auto-detect columns.

Business Logic: Note that sale_amount may differ from price * quantity due to:

Volume discounts
Promotional pricing
Customer-specific pricing tiers
Currency conversion (for international sales)

Preview data.

Business Significance:

sale_amount: Actual revenue (may include discounts)
quantity: Volume metrics for demand planning
payment_method: Payment preference insights
status: Filter out cancelled/refunded orders

Products (ERP system)

Double-click on the second TFI step, and configure with the following properties:

Setting

Value

Step name

Products

Filename

pvfs://MinIO/raw-data/csv/products.csv

Delimiter

Head row present

✅

Format

mixed

Click: Get Fields to auto-detect columns.

Preview the data.

Business Significance:

category: Enables product performance analysis by segment
price: Base pricing for margin calculations
stock_quantity: Inventory turnover insights

Customers (CRM System)

Double-click on the third TFI step, and configure with the following properties:

Setting

Value

Step name

Customers

Filename

pvfs://MinIO/raw-data/csv/customers.csv

Delimiter

Header row present

✅

Format

mixed

Click: Get Fields to auto-detect columns.

Preview the data.

Business Significance:

customer_id: Primary key for joining to sales
country: Critical for geographic segmentation
status: Identifies churned vs. active customers
registration_date: Enables customer tenure analysis

Inventory Reconciliation - XML + CSV Integration

This workshop demonstrates how Pentaho Data Integration eliminates costly inventory discrepancies by automatically reconciling data between warehouse management systems (XML feeds) and ERP product catalogs (CSV files). Organizations lose millions annually due to inventory inaccuracies, stockouts, and overstocking. PDI's ability to parse complex XML and perform full outer joins enables real-time discrepancy detection that would require hours of manual spreadsheet work.

Business Value Delivered:

Cost Reduction: Eliminate manual reconciliation labor ($75K-150K annually per analyst)
Inventory Optimization: Reduce excess inventory carrying costs by 15-25%
Stockout Prevention: Identify missing items before customers notice
Compliance: Audit trail for SOX, ISO 9001, and supply chain regulations
Real-Time Visibility: Know your actual inventory position within minutes, not days

Scenario: A manufacturing company operates 12 distribution warehouses. Each warehouse uses a legacy WMS (Warehouse Management System) that exports XML inventory files nightly. The corporate ERP system maintains a CSV product master catalog. Discrepancies cause:

Phantom stock: ERP shows item in stock, warehouse says it's not → Lost sales
Ghost inventory: Warehouse has items ERP doesn't recognize → Dead capital
Quantity variances: Mismatches of 10+ units trigger expensive physical counts

Key Stakeholders:

Supply Chain Directors: Need accurate inventory positions across all locations
Warehouse Managers: Require daily reconciliation reports to prioritize cycle counts
Finance Teams: Must report accurate inventory valuations for financial statements
Procurement: Need to identify slow-moving items and prevent overstocking

Workshop files

These files are already in MinIO:

pvfs://MinIO/raw-data/xml/inventory.xml
pvfs://MinIO/raw-data/csv/products.csv

Planned output path: pvfs://MinIO/staging/inventory/reconciliation/

Create a new transformation.

Use any of these options:

Select File > New > Transformation
Use Ctrl+N (Windows/Linux) or Cmd+N (macOS)

Follow the steps to create the transformation:

Get data from XML

Drag & drop 'Get data from XML' onto the canvas.
Save transformation as: inventory_reconciliation.ktr in your workshop folder.
Double-click on the 'Get data from XML' step, and configure with the following properties:

Setting

Value

Step name

Read Warehouse XML

File or directory

pvfs://MinIO/raw-data/xml/inventory.xml

Loop XPath

/inventory/items/item

Encoding

UTF-8

Ignore comments

✅

Validate XML

Ignore empty file

✅

XPath Explanation:

/inventory = Start at root element
/items = Navigate to items container
/item = Loop over each item element

Browse & Add the path to the inventory.xml
Click on the Content tab

Click on the Fields tab & Get Fields.
Remap the fields & Preview rows.

Business Field Naming:

Prefix with warehouse_ to distinguish from ERP fields later
warehouse_quantity vs. stock_quantity makes joins clearer
Keep original field names in a data dictionary for auditing

Name

XPath

warehouse_item_name

name

warehouse_quantity

quantity

warehouse_location

location

last_physical_count

last_checked

Next: configure the product catalog input, then join the two streams.

Data Lake Ingestion

Modern data lakes often receive the same entities (products, customers, orders) from multiple sources in different formats. This workshop demonstrates how to ingest, normalize, validate, and deduplicate multi-format data into a unified schema - a common data engineering pattern.

Objective: Combine data from CSV, JSON, and XML into a unified product schema.

Skills: Multi-format parsing, schema normalization, data validation, deduplication

Workshop files

These files are already in MinIO:

pvfs://MinIO/raw-data/csv/products.csv
pvfs://MinIO/raw-data/json/api_response.json
pvfs://MinIO/raw-data/xml/inventory.xml

Create a new transformation.

Use any of these options:

Select File > New > Transformation
Use Ctrl+N (Windows/Linux) or Cmd+N (macOS)

Define Target Schema

Objective: Design a unified schema that accommodates all source formats.

Why Important: Before ingesting data, you need a clear target schema. This ensures consistency across all sources and makes downstream analytics easier.

Field

Type

Length

Description

Source Mapping

product_id

String

Unique product identifier

CSV: product_id JSON: product_id XML: sku

product_name

String

200

Product display name

CSV: product_name JSON: product_name XML: name

Schema Discovery & Analysis

Objective: Understand each source structure before you design the target schema.

Why it matters: You can’t normalize what you haven’t inspected.

Inspect each Data Source

Use real samples. Avoid guessing field names.

Inspect the file

mc cat minio-local/raw-data/csv/products.csv | head -5

Sample output

product_id,product_name,category,price,stock_quantity
PROD-001,Laptop Pro 15,Electronics,999.99,50
PROD-002,Office Chair,Furniture,299.99,100
PROD-003,Coffee Maker,Appliances,79.99,200

Findings

Has product_id, product_name, category, price, stock_quantity.
Completeness looks high.
Naming is consistent and explicit.

Inspect one nested item

mc cat minio-local/raw-data/json/api_response.json | jq '.data.orders[0].items[0]'

Sample output

{
  "product_id": "PROD-001",
  "product_name": "Laptop Pro 15",
  "unit_price": 999.99,
  "quantity": 2
}

Findings

Has product_id and product_name.
Uses unit_price instead of price.
quantity is order quantity, not stock.
category is missing.
Path is $.data.orders[*].items[*].

Inspect one item node

mc cat minio-local/raw-data/xml/inventory.xml | grep -A 6 "<item>" | head -10

Sample output

<item>
  <sku>PROD-001</sku>
  <name>Laptop Pro 15</name>
  <category>Electronics</category>
  <quantity>50</quantity>
  <location>A-15</location>
</item>

Findings

Uses sku for product_id.
Uses name for product_name.
Has category and warehouse quantity.
price is missing.
location is extra for a product master.

Build a field mapping matrix

This shows name differences and missing fields.

Unified field

CSV

JSON

XML

Notes

Identifier

product_id

sku

Same meaning. Different name in XML.

Name

product_name

name

Same meaning. Different name in XML.

Ingest Data Sources

Path convention used below: pvfs://MinIO/...

MinIO is the VFS connection name. It must match your connection exactly.

Ingest CSV products

Goal: Read products.csv and map it to the unified schema.

Path: pvfs://MinIO/raw-data/csv/products.csv

Add a Text file input step.
- Step name: Read CSV Products
- File/directory: pvfs://MinIO/raw-data/csv/products.csv
- Separator: ,
- Enclosure: " (double quote)
- Header row present: enabled
On Fields, select Get Fields.
Add a Select values step.
- Step name: Map CSV to Target Schema
- Rename stock_quantity → quantity
Add Add constants.
- Step name: Add CSV Metadata
- Add field source_system = csv
Add Get System Info.
- Step name: Add Ingestion Timestamp
- Add field ingestion_time = system date (variable)

Preview check

Expected rows: 12
product_id, product_name, category should be populated.

Ingest JSON order items

Goal: Extract product fields from nested JSON order items.

Path: pvfs://MinIO/raw-data/json/api_response.json

quantity in JSON is order quantity, not stock quantity.

Keep it as quantity only if that’s what you want to model.

Add a JSON Input step.
- Step name: Read JSON Products
- File: pvfs://MinIO/raw-data/json/api_response.json
- Ignore empty file: enabled
On Fields, use explicit JSONPaths (recommended):
- product_id: $.data.orders[*].items[*].product_id
- product_name: $.data.orders[*].items[*].product_name
- unit_price: $.data.orders[*].items[*].unit_price
- quantity: $.data.orders[*].items[*].quantity

Alternative approach (base path + relative field paths)

If your PDI build supports a base “Path” for the JSON Input step, set:\n\n- Base path: $.data.orders[*].items[*]\n\nThen set field paths relative to the base:\n\n- product_id: product_id\n- product_name: product_name\n- unit_price: unit_price\n- quantity: quantity\n

Add a Select values step.
- Step name: Map JSON to Target Schema
- Rename unit_price → price
Add Add constants.
- Step name: Add JSON Metadata
- source_system = json
- category = E-commerce (default)
Add Get System Info.
- Step name: Add JSON Ingestion Timestamp
- ingestion_time = system date (variable)

Preview check

Expected rows: ~10–15 (can vary with sample file).
product_name should not be NULL.

Ingest XML inventory items

Goal: Extract inventory items from XML using XPath.

Path: pvfs://MinIO/raw-data/xml/inventory.xml

Add Get data from XML.
- Step name: Read XML Products
- File: pvfs://MinIO/raw-data/xml/inventory.xml
- Loop XPath: /inventory/items/item
On Fields, add:
- sku (String)
- name (String)
- category (String)
- quantity (Integer)

Field XPaths are relative to the loop node.

Example: sku means “read the <sku> element under each <item>”.

Add a Select values step.
- Step name: Map XML to Target Schema
- Rename sku → product_id
- Rename name → product_name
- Add a new field price in Meta-data (type Number). Leave it empty (NULL).
Add Add constants.
- Step name: Add XML Metadata
- source_system = xml
Add Get System Info.
- Step name: Add XML Ingestion Timestamp
- ingestion_time = system date (variable)

Preview check

Expected rows: ~8–10
If you get 0 rows, re-check the Loop XPath.

Merge Streams

Objective: Merge all three data streams (CSV, JSON, XML) into one unified stream.

Why Append Streams: This step stacks all rows from different sources vertically - like a SQL UNION ALL.

Configuration:

Add Append streams step
- Name: "Combine All Products"
Connect all three streams to this step:
- "Add Ingestion Timestamp" (CSV branch) → Append streams
- "Add JSON Ingestion Timestamp" (JSON branch) → Append streams
- "Add XML Ingestion Timestamp" (XML branch) → Append streams
Important: All input streams MUST have the same fields with the same names and types:
- product_id (String)
- product_name (String)
- category (String)
- price (Number) - can be null
- quantity (Integer)
- source_system (String)
- ingestion_time (Timestamp)

Expected Output:

Row count: ~30-35 rows (12 CSV + 10-15 JSON + 8-10 XML)
All products from all sources combined
Some products will appear multiple times (duplicates to be handled in Step 7)

Preview Check:

product_id   product_name      source_system  price
PROD-001     Laptop Pro 15     csv            999.99
PROD-002     Office Chair      csv            299.99
...
PROD-001     Laptop Pro 15     json           999.99   ← Duplicate!
PROD-005     Desk Lamp         json           45.00
...
PROD-001     Laptop Pro 15     xml            null     ← Duplicate, no price
PROD-002     Office Chair      xml            null

Data Validation

Objective: Validate data quality and route bad records to error handling.

Why Important: Multi-source data often has quality issues. Better to catch and handle them explicitly than have them cause downstream failures.

Configuration:

Add Data Validator step
- Name: "Validate Product Data"
Validations tab - Add validation rules:
Fieldname
Validation Type
Configuration
Error Message
product_id
NOT NULL
Product ID is required
product_id
NOT EMPTY STRING
Product ID cannot be empty
product_name
NOT NULL
Product name is required
product_name
NOT EMPTY STRING
Product name cannot be empty
price
NUMERIC RANGE
Min: 0, Max: 999999
Price must be >= 0 (if present)
quantity
NUMERIC RANGE
Min: 0, Max: 999999
Quantity must be >= 0
Options tab:
- ☑ Concatenate errors: Shows all validation errors for a row
- Separator: , (comma-space)
- ☑ Output all errors as one field: validation_errors
Add Filter rows step after Data Validator
- Name: "Route Valid vs Invalid"
Condition:
```
validation_errors IS NULL
```
- True (valid records) → Continue to deduplication
- False (invalid records) → Error output
Add Text file output for errors (connect from False branch):
- Name: "Write Error Records"
- Filename: pvfs://MinIO/curated/products/errors/validation_errors_${Internal.Job.Start.Date.yyyyMMdd}.csv
- Include date in filename: Helps track when errors occurred
- Fields to output: All fields + validation_errors

Expected Output:

Valid records: ~95-100% should pass (25-35 rows)
Invalid records: 0-5% to error file (0-2 rows)

Common Validation Failures:

Empty product_id or product_name
Negative price or quantity values
Non-numeric values in numeric fields

PreviousMinIO NextSMB

Last updated 1 month ago

Was this helpful?

Good morning

MinIO

Workshop series: PDI + MinIO (S3)

Sales Dashboard

Text File Input

Stream Lookup

Calculator

Formula

Add Constants

Get system info

Select Values

Text file output

Inventory Reconciliation - XML + CSV Integration

Get data from XML

Customer 360

Text file input

Sort rows

Log Parsing and Anomaly Detection

Transactions & Fraud Detection

Data Lake Ingestion

Define Target Schema

Schema Discovery & Analysis

Ingest Data Sources

Ingest JSON order items

Ingest XML inventory items

Merge Streams

Data Validation

Good morning

hashtagWorkshop series: PDI + MinIO (S3)

hashtagSales Dashboard

hashtagText File Input

hashtagStream Lookup

hashtagCalculator

hashtagFormula

hashtagAdd Constants

hashtagGet system info

hashtagSelect Values

hashtagText file output

hashtagInventory Reconciliation - XML + CSV Integration

hashtagGet data from XML

hashtagCustomer 360

hashtagText file input

hashtagSort rows

hashtagLog Parsing and Anomaly Detection

hashtagTransactions & Fraud Detection

hashtagData Lake Ingestion

hashtagDefine Target Schema

hashtagSchema Discovery & Analysis

hashtagIngest Data Sources

hashtagIngest JSON order items

hashtagIngest XML inventory items

hashtagMerge Streams

hashtagData Validation

Workshop series: PDI + MinIO (S3)

Sales Dashboard

Text File Input

Stream Lookup

Calculator

Formula

Add Constants

Get system info

Select Values

Text file output

Inventory Reconciliation - XML + CSV Integration

Get data from XML

Customer 360

Text file input

Sort rows

Log Parsing and Anomaly Detection

Transactions & Fraud Detection

Data Lake Ingestion

Define Target Schema

Schema Discovery & Analysis

Ingest Data Sources

Ingest JSON order items

Ingest XML inventory items

Merge Streams

Data Validation