Storage
Setup Object Stores & SMB ..
Object Stores
Object storage systems like Amazon S3 and MinIO provide a way to store and retrieve large amounts of unstructured data such as files, images, videos, and backups through a simple web-based API. Unlike traditional file systems that organize data in hierarchical folders, object stores use a flat namespace where each piece of data (called an object) is stored in containers called buckets and accessed via unique keys or URLs.
Amazon S3 is AWS's flagship object storage service that offers virtually unlimited scalability, multiple storage classes for different use cases, and integration with other AWS services.
MinIO is an open-source alternative that provides S3-compatible APIs and can be deployed on-premises or in private clouds, making it popular for organizations that want object storage capabilities without vendor lock-in.
Both systems are designed for high durability, availability, and can handle massive scale while providing simple REST API access for applications to store and retrieve data programmatically.

The following steps are intended for setting up a Pentaho Lab environment and need to be completed in order to complete the Workshops.
Ensure you have downloaded the Workshop--Installation:
To install git:
Prerequisites
Ubuntu 24.04 LTS system (physical or virtual machine)
User account with sudo privileges
Internet connection
Basic familiarity with Linux command line
MinIO
Follow the instructions below to setup a MinIO Docker Container.
Select your OS & add the Sample Data, finally configure a VFS connection in Data Integration:
Installs and configures MinIO on Ubuntu 24.04 running in Docker.
Create a MinIO folder and copy the required files.
Create directory & copy

Ensure all the files have successfully been copied over.
Execute the docker-compose script to create the container.
MinIO Container

Following best industry practices, MinIO is installed as root in the /opt/minio directory. If you wish the pentaho user to also manage the service then you may need to add the user to the Docker group.
Check the container is up and running in Docker.

Log into MinIO.
Username: minioadmin
Password: minioadmin
If you have completed the setup: MinIO then you should have pre-populated buckets with various data objects in different formats.

New Bucket
If you need to create a Bucket:
Click the 'Create Bucket' link.
Enter: sales-data & 'Create Bucket'.

Click on the Upload button.

Upload your data - for example some sales data:
Windows - PowerShell
Linux
Workshops
Pentaho Data Integration: MinIO Object Storage Workshop Series
Modern organizations increasingly store their data in cloud-native object storage systems like MinIO and Amazon S3, moving away from traditional file servers and databases. This architectural shift enables scalable, cost-effective data lakes but introduces new challenges: data arrives in multiple formats (CSV, JSON, XML, Parquet), exists across distributed buckets, and requires sophisticated transformation pipelines to unlock its analytical value. Learning to efficiently extract, transform, and integrate data from object storage is now essential for any data integration professional working with contemporary data architectures.
In this comprehensive workshop series, you'll build progressively complex transformation pipelines that leverage MinIO object storage as both a source and destination for enterprise data integration scenarios. Starting with fundamental ETL patterns like denormalized fact table creation, you'll advance through intermediate challenges involving multi-format parsing and reconciliation, ultimately mastering advanced techniques like sessionization, anomaly detection, and schema normalization across heterogeneous data sources.
Each workshop introduces real-world business scenarios - from sales dashboards to customer analytics to operational monitoring - demonstrating PDI's versatility in solving diverse integration challenges while maintaining cloud-native architecture principles.
What You'll Accomplish:
Configure VFS (Virtual File System) connections to access S3-compatible MinIO object storage
Build multi-source ETL pipelines using Text File Input steps with S3 paths (s3a://)
Implement Stream Lookup and Merge Join patterns to enrich data from multiple CSV sources
Parse semi-structured formats including XML inventory feeds and JSONL event streams
Apply full outer joins to identify discrepancies between warehouse and catalog systems
Aggregate customer data across transactional, demographic, and behavioral dimensions
Perform sessionization and funnel analysis on clickstream data using Group By and pivoting
Extract structured data from unstructured logs using Regular Expression evaluation
Detect anomalies in time-series data through rolling averages and conditional logic
Normalize schemas across CSV, JSON, and XML sources into unified data lake structures
Implement data validation, deduplication, and quality controls for multi-format ingestion
Calculate derived metrics including customer lifetime value, engagement scores, and conversion rates
Route data dynamically using Switch/Case and Filter Rows for conditional processing
Output transformed data to staging and curated layers following data lake architecture patterns
By the end of this workshop series, you'll have mastered the complete spectrum of cloud-native data integration patterns using Pentaho Data Integration. You'll understand how to handle diverse source formats, implement sophisticated join and aggregation logic, perform advanced text parsing and time-series analysis, and build production-ready pipelines that leverage object storage for scalable, distributed data processing.
Instead of treating each data format as a unique challenge requiring custom scripts, you'll confidently design reusable, visual transformation workflows that automate complex integration scenarios - from operational reconciliation to customer intelligence to real-time anomaly detection.
Prerequisites: MinIO running with sample data populated; basic understanding of transformation concepts (steps, hops, preview); familiarity with joins and aggregations
Estimated Time: 4-6 hours total (individual workshops range from 20-60 minutes based on complexity)
Sales Dashboard
joins, lookups, aggregations
Inventory Reconciliation
XML parsing, outer joins, variance
Customer 360
multi-source, JSONL, calculations
Clickstream Funnel
sessionization, pivoting
Log Parsing
regex, time-series analysis
Data Lake Ingestion
schema normalization, validation
1. Verify that MinIO is running and populated.
Start Pentaho Data Integration.
Windows - PowerShell:
Linux:
Workshops
Sales Dashboard
The workshop demonstrates how Pentaho Data Integration enables organizations to rapidly create denormalized fact tables that power real-time business intelligence dashboards. By integrating data from multiple sources (customer data, product catalogs, and sales transactions), business users gain immediate access to actionable insights without waiting for IT to build complex data warehouses.
Scenario: A mid-sized e-commerce company needs to track daily sales performance across products, customer segments, and regions. Currently, sales managers wait 24-48 hours for IT to generate reports from disparate systems. With PDI, they can automate this process and refresh dashboards hourly.
Key Stakeholders:
Sales Directors: Need to identify top-performing products and regions
Marketing Teams: Require customer segmentation for targeted campaigns
Finance: Need accurate revenue reporting by product category
Operations: Must monitor inventory turnover rates

Follow the steps to create the transformation:
Text File Input
The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.
The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

Drag & drop 3 Text File Input Steps onto the canvas.
Save transformation as:
sales_dashboard_etl.ktrin your workshop folder.
Sales (Order Management)
Double-click on the first TFI step, and configure with the following properties:
Step name
Sales
Filename
pvfs://Minio/raw-data/csv/sales.csv
Delimiter
,
Head row present
✅
Format
mixed

Click: Get Fields to auto-detect columns.
Business Logic: Note that sale_amount may differ from price * quantity due to:
Volume discounts
Promotional pricing
Customer-specific pricing tiers
Currency conversion (for international sales)

Preview data.

Business Significance:
sale_amount: Actual revenue (may include discounts)quantity: Volume metrics for demand planningpayment_method: Payment preference insightsstatus: Filter out cancelled/refunded orders
Products (ERP system)
Double-click on the second TFI step, and configure with the following properties:
Step name
Products
Filename
pvfs://Minio/raw-data/csv/products.csv
Delimiter
,
Head row present
✅
Format
mixed

Click: Get Fields to auto-detect columns.

Preview the data.

Business Significance:
category: Enables product performance analysis by segmentprice: Base pricing for margin calculationsstock_quantity: Inventory turnover insights
Customers (CRM System)
Double-click on the third TFI step, and configure with the following properties:
Step name
Customers
Filename
pvfs://MinIO/raw-data/csv/customers.csv
Delimiter
,
Header row present
✅
Format
mixed

Click: Get Fields to auto-detect columns.

Preview the data.

Business Significance:
customer_id: Primary key for joining to salescountry: Critical for geographic segmentationstatus: Identifies churned vs. active customersregistration_date: Enables customer tenure analysis
Stream Lookup
A Stream lookup step enriches rows by looking up matching values from another stream.
In a transformation, you feed your main rows into one hop and a reference dataset into the other hop. The step then matches rows using key fields and returns the lookup fields on the output. It’s the in-memory alternative to a database lookup, but the reference stream must be available in the same transformation flow.

Drag & drop 2 Stream Lookups Input Steps onto the canvas.
Save transformation as:
sales_dashboard_etl.ktrin your workshop folder.
Product Lookup
Draw a Hop bewteen 'Sales' step & 'Product Lookup' step.
Draw a Hop bewteen 'Product' step & 'Product Lookup' step.
The Sales is acting as our Fact table. It holds the transaction data for our Products & Customers.
Double-click on the 'Product Lookup' step, and configure with the following properties:
General
Step name
Product Lookup
General
Lookup step
Products
Keys
Field (from Sales)
product_id
Keys
Field (from Products)
product_id
In Values to retrieve, add:
product_name(rename toproduct_name)category(rename toproduct_category)price(rename tounit_price)

Customers Lookup
Draw a Hop bewteen 'Product Lookup' step & 'Customers Lookup' step.
Draw a Hop bewteen 'Customers' step & 'Customers Lookup' step.
Double-click on the 'Customer Lookup' step, and configure with the following properties:
Step name
Customers Lookup
Lookup step
Customers
Key field (stream)
customer_id
Key field (lookup)
customer_id
Values to retrieve:
first_namelast_namecountry(rename tocustomer_country)status(rename tocustomer_status)

Preview data
Save the transformation.
RUN & Preview the data.

Calculator
The Calculator step provides predefined functions that you can run on input field values. Use Calculator as a quick alternative to custom JavaScript for common calculations.
To use Calculator, specify the input fields and the calculation type, and then write results to new fields. You can also remove temporary fields from the output after all values are calculated.

Drag & drop a 'Calculator' step onto the canvas.
Draw a Hop from the 'Customers Lookup' step to the 'Calculator' step.
Double-click on the 'Calculator' step, and configure the following properties:
line_total
A * B
quantity
unit_price
Number
profit_margin
A - B
sale_amount
line_total
Number

Preview data
Save the transformation.
RUN & Preview the data.

Business Insight Enabled:
Positive discount_amount: Customer received a discount (common)
Negative discount_amount: Customer paid MORE than catalog price (premium service, expedited shipping, etc.)
Zero discount_amount: Sold at list price
Formula
The Formula step can calculate Formula Expressions within a data stream. It can be used to create simple calculations like [A]+[B] or more complex business logic with a lot of nested if / then logic.

Drag & drop a 'Formula' step onto the canvas.
Draw a Hop from the 'Calculator' step to the 'Formula' step.
Double-click on the 'Formula' step, and configure the following properties:
customer_full_name
CONCATENATE([first_name];" ";[last_name])
is_high_value
IF([sale_amount]>500;"Yes";"No")

Preview data
Save the transformation.
RUN & Preview the data.

Business Applications:
is_high_value: Trigger VIP customer service workflows
discount_percentage: Measure promotion effectiveness by channel
payment_risk: Flag transactions for fraud review
Add Constants
The Add constant values step is a simple and high performance way to add constant values to the stream.

Drag & drop 'Add constants' step onto the canvas.
Draw a Hop from the 'Formula' step to the 'Add constants ' step.
Double-click on the 'Add constants' step, and configure the following properties:
data_source
String
minio_workshop

Get system info
This step retrieves system information from the Kettle environment. The step includes a table where you can designate a name and assign it to any available system info type you want to retrieve. This step generates a single row with the fields containing the requested information.
It can also accept any number of input streams, aggregate any fields defined by this step, and send the combined results to the output stream.

Drag & drop 'Get system info' step onto the canvas.
Draw a Hop from the 'Add constants' step to the 'Get system info ' step.
Double-click on the 'Add constants' step, and configure the following properties:
etl_timestamp
system date (variable)

Select Values
The Select Values step can perform all the following actions on fields in the PDI stream:
Select fields - The Select Values step can perform all the following actions on fields in the PDI stream.
Remove fields - Use this tab to remove fields from the input stream.
Meta-data - Use this tab to remove fields from the input stream.

Drag & drop a 'Select values' step onto the canvas.
Draw a Hop from the 'Get system info' step to the 'Select values' step.
Double-click on the 'Select values' step, and configure the following properties:
On Select & Alter tab, choose fields in order:
sale_id
sale_date
customer_id
customer_full_name
customer_country
customer_status
product_id
product_name
product_category
quantity
unit_price
sale_amount
line_total
profit_margin
is_high_value
payment_method
status (rename to
sale_status)etl_timestamp
data_source

Preview data
Save the transformation.
RUN & Preview the data.

Text file output
The Text File Output step exports rows to a text file.
This step is commonly used to generate delimited files (for example, CSV) that can be read by spreadsheet applications, and it can also generate fixed-length output.
You can’t run this step in parallel to write to the same file.
If you need to run multiple copies, select Include stepnr in filename and merge the resulting files afterward.

Drag & drop a 'Select values' step onto the canvas.
Draw a Hop from the 'Select values' step to the 'Write to staging' step.
Double-click on the 'Write to staging' step, and configure with the following properties:
Step name
Write to Staging
Filename
pvfs://MinIO/staging/dashboard/sales_fact
Extension
csv
Include date/time in filename
✅
Separator
,
Add header
✅
Remember to: Get Fields
Business Benefit: Timestamped files enable:
Historical tracking: "What did the data look like last Tuesday?"
Incremental processing: Keep processing latest file without overwriting history
Rollback capability: "The 3pm run had bad data, revert to 2pm version"
MinIO
Sve the transformation.
Log into MinIO:

Checklist
Inventory Reconciliation - XML + CSV Integration
This workshop demonstrates how Pentaho Data Integration eliminates costly inventory discrepancies by automatically reconciling data between warehouse management systems (XML feeds) and ERP product catalogs (CSV files). Organizations lose millions annually due to inventory inaccuracies, stockouts, and overstocking. PDI's ability to parse complex XML and perform full outer joins enables real-time discrepancy detection that would require hours of manual spreadsheet work.
Business Value Delivered:
Cost Reduction: Eliminate manual reconciliation labor ($75K-150K annually per analyst)
Inventory Optimization: Reduce excess inventory carrying costs by 15-25%
Stockout Prevention: Identify missing items before customers notice
Compliance: Audit trail for SOX, ISO 9001, and supply chain regulations
Real-Time Visibility: Know your actual inventory position within minutes, not days
Scenario: A manufacturing company operates 12 distribution warehouses. Each warehouse uses a legacy WMS (Warehouse Management System) that exports XML inventory files nightly. The corporate ERP system maintains a CSV product master catalog. Discrepancies cause:
Phantom stock: ERP shows item in stock, warehouse says it's not → Lost sales
Ghost inventory: Warehouse has items ERP doesn't recognize → Dead capital
Quantity variances: Mismatches of 10+ units trigger expensive physical counts
Key Stakeholders:
Supply Chain Directors: Need accurate inventory positions across all locations
Warehouse Managers: Require daily reconciliation reports to prioritize cycle counts
Finance Teams: Must report accurate inventory valuations for financial statements
Procurement: Need to identify slow-moving items and prevent overstocking

Follow the steps to create the transformation:
x
Drag & drop 'Get data from XML' onto the canvas.
Save transformation as:
sales_dashboard_etl.ktrin your workshop folder.Double-click on the 'Get data from XML' step, and configure with the following properties:
Step name
Read Warehouse XML
File or directory
pvfs://MinIO/raw-data/xml/inventory.xml
Loop XPath
/inventory/items/item
Encoding
UTF-8
Ignore comments
✅
Validate XML
No
Ignore empty file
✅
XPath Explanation:
/inventory= Start at root element/items= Navigate to items container/item= Loop over each item element
Browse & Add the path to the inventory.xml
x
Click on the Content tab

Click on the Fields tab & Get Fields.
Remap the fields & Preview rows.
Business Field Naming:
Prefix with
warehouse_to distinguish from ERP fields laterwarehouse_quantityvs.stock_quantitymakes joins clearerKeep original field names in a data dictionary for auditing
warehouse_item_name
name
warehouse_quantity
quantity
warehouse_location
location
last_physical_count
last_checked

x
x
x
x
x
x
Customer 360
Create unified customer profiles combining demographic data, purchase history, and behavioral events.
Skills: Multiple joins, JSONL parsing, aggregations, calculated metrics

x
x
Text file input
The Text File Output step exports rows to a text file.
This step is commonly used to generate delimited files (for example, CSV) that can be read by spreadsheet applications, and it can also generate fixed-length output.
You can’t run this step in parallel to write to the same file.
If you need to run multiple copies, select Include stepnr in filename and merge the resulting files afterward.
x
Drag & drop 'Text file input' steps onto the canvas.
Save transformation as:
customer_360.ktrin your workshop folder.
Sales (Order Management)
Double-click on the first TFI step, and configure with the following properties:
Step name
Sales
Filename
pvfs://Minio/raw-data/csv/sales.csv
Delimiter
,
Head row present
✅
Format
mixed

Click: Get Fields to auto-detect columns.
Business Logic: Note that sale_amount may differ from price * quantity due to:
Volume discounts
Promotional pricing
Customer-specific pricing tiers
Currency conversion (for international sales)

Preview data.

Business Significance:
sale_amount: Actual revenue (may include discounts)quantity: Volume metrics for demand planningpayment_method: Payment preference insightsstatus: Filter out cancelled/refunded orders
Transactions & Fraud Detection
Objective: Process credit card transactions, enrich with account and merchant data, calculate transaction metrics, and detect suspicious patterns using rule-based fraud detection.
Skills: Financial data processing, multi-table joins, running totals, rule-based fraud detection, transaction velocity analysis
Business Context: A payment processor needs to analyze transaction data in real-time to detect potentially fraudulent activity before authorizing transactions. The system must flag high-risk transactions based on amount thresholds, unusual merchant activity, account balance checks, and transaction velocity patterns.
x
x
Data Lake Ingestion
Modern data lakes often receive the same entities (products, customers, orders) from multiple sources in different formats. This workshop demonstrates how to ingest, normalize, validate, and deduplicate multi-format data into a unified schema - a common data engineering pattern.
Objective: Combine data from CSV, JSON, and XML into a unified product schema.
Skills: Multi-format parsing, schema normalization, data validation, deduplication
x
x
Define Target Schema
Objective: Design a unified schema that accommodates all source formats.
Why Important: Before ingesting data, you need a clear target schema. This ensures consistency across all sources and makes downstream analytics easier.
x
product_id
String
50
Unique product identifier
CSV: product_id JSON: product_id XML: sku
product_name
String
200
Product display name
CSV: product_name JSON: product_name XML: name
category
String
100
Product category
CSV: category JSON: (derived from order type) XML: category
price
Number
15,2
Unit price in USD
CSV: price JSON: unit_price XML: null (not available)
quantity
Integer
10
Available stock quantity
CSV: stock_quantity JSON: quantity XML: quantity
source_system
String
10
Origin system identifier
Constant: 'csv', 'json', or 'xml'
ingestion_time
Timestamp
-
When record was ingested
System timestamp
x
x
Schema Discovery & Analysis
Objective: Understand each source structure before you design the target schema.
Why it matters: You can’t normalize what you haven’t inspected.
Inspect each source
Use real samples. Avoid guessing field names.
Findings
Has
product_id,product_name,category,price,stock_quantity.Completeness looks high.
Naming is consistent and explicit.
Findings
Has
product_idandproduct_name.Uses
unit_priceinstead ofprice.quantityis order quantity, not stock.categoryis missing.Path is
$.data.orders[*].items[*].
Findings
Uses
skuforproduct_id.Uses
nameforproduct_name.Has
categoryand warehousequantity.priceis missing.locationis extra for a product master.
Build a field mapping matrix
This shows name differences and missing fields.
Identifier
product_id
product_id
sku
Same meaning. Different name in XML.
Name
product_name
product_name
name
Same meaning. Different name in XML.
Category
category
❌
category
Missing in JSON.
Price
price
unit_price
❌
Different name in JSON. Missing in XML.
Stock quantity
stock_quantity
quantity
quantity
JSON quantity is not stock.
What to watch
Missing data is normal in multi-source ingestion.
Same name can mean different things.
Make schema decisions
Write these down. You will forget them later.
Field names
Use CSV naming as the standard.
Map XML
sku → product_idandname → product_name.Map JSON
unit_price → price.
Missing fields
Missing
categoryin JSON: set a default likeE-commerce.Missing
pricein XML: leaveNULL.
Data types
product_id: string. It containsPROD-prefix.product_name: string. Allow up to 200 chars.category: string. Allow up to 100 chars.price: decimal(15,2).quantity: integer.
Metadata
Add
source_systemfor lineage.Add
ingestion_timefor auditability.
Define a deduplication rule
Same product_id can appear in multiple sources.
Recommended rule
Prefer CSV.
Then JSON.
Then XML.
Implement this with source_priority (CSV=1, JSON=2, XML=3).
Checklist
You inspected real records for each source.
You captured paths for nested formats.
You documented mappings and type choices.
You decided how to handle missing data.
You decided how to dedupe collisions.
x
x
x
Step 2: Ingest CSV Products
Objective: Read CSV products and map to target schema.
Why CSV First: CSV is the simplest format and requires minimal transformation - good for testing your target schema.
Configuration:
Add Text file input step
Name: "Read CSV Products"
File/directory:
pvfs://MinIO/raw-data/csv/products.csvSeparator: Comma (,)
Enclosure: " (double quote)
Header: ☑ Header row present
Fields tab (click "Get Fields"):
NameTypeFormatLengthPrecisionproduct_id
String
50
product_name
String
200
category
String
100
price
Number
#.##
15
2
stock_quantity
Integer
10
Add Select values step
Name: "Map CSV to Target Schema"
Select & Alter tab - Rename fields to match target:
Fieldname (from previous)Rename toTypeLengthPrecisionproduct_id
product_id
String
50
product_name
product_name
String
200
category
category
String
100
price
price
Number
15
2
stock_quantity
quantity
Integer
10
Add Add constants step
Name: "Add CSV Metadata"
Fields tab:
Field nameTypeValuesource_system
String
csv
Add Get System Info step
Name: "Add Ingestion Timestamp"
Fields tab:
NameTypeingestion_time
system date (variable)
Expected Output (Preview):
Row Count: 12 products from CSV
Step 3: Ingest JSON Products
Objective: Extract product data from nested JSON structure.
Why More Complex: JSON often contains nested structures. You need to use JSONPath to navigate to the data you want.
JSON Structure Overview:
JSONPath: $.data.orders[*].items[*] extracts all items from all orders.
Configuration:
Add JSON Input step
Name: "Read JSON Products"
File tab:
File or directory:
pvfs://MinIO/raw-data/json/api_response.jsonInclude subfolders: ☐ No
Content tab:
Source is defined in a field: ☐ No
Source is a URL: ☐ No
Ignore empty file: ☑ Yes
Do not raise error if no files: ☐ No
Limit: 0 (no limit)
Fields tab:
Click Select fields and set JSONPath:
NamePathTypeLengthPrecisionproduct_id
$.product_id
String
50
product_name
$.product_name
String
200
unit_price
$.unit_price
Number
15
2
quantity
$.quantity
Integer
10
Important: The base path
$.data.orders[*].items[*]is entered in the Content tab → Path field (if available in your PDI version), or you use full paths like$.data.orders[*].items[*].product_idAdd Select values step
Name: "Map JSON to Target Schema"
Select & Alter tab:
FieldnameRename toTypeLengthPrecisionproduct_id
product_id
String
50
product_name
product_name
String
200
unit_price
price
Number
15
2
quantity
quantity
Integer
10
Add Add constants step
Name: "Add JSON Metadata"
Fields tab:
Field nameTypeValuesource_system
String
json
category
String
E-commerce
Note: JSON doesn't have category, so we set a default value "E-commerce"
Add Get System Info step
Name: "Add JSON Ingestion Timestamp"
Add field:
ingestion_time→ Type:system date (variable)
Expected Output (Preview):
Row Count: ~10-15 product entries (some may be duplicates from different orders)
Step 4: Ingest XML Products
Objective: Extract product data from XML using XPath.
Why Different: XML uses hierarchical structure with tags. XPath (like SQL for XML) lets you query specific elements.
XML Structure Overview:
XPath: /inventory/items/item selects all <item> nodes
Configuration:
Add Get data from XML step
Name: "Read XML Products"
File tab:
File or directory:
pvfs://MinIO/raw-data/xml/inventory.xmlInclude subfolders: ☐ No
Content tab:
Loop XPath:
/inventory/items/itemEncoding: UTF-8
Namespace aware: ☐ No (unless XML has namespaces)
Ignore comments: ☑ Yes
Validate: ☐ No
Fields tab - Define fields to extract:
NameXPathElement typeTypeLengthFormatsku
sku
Element
String
50
name
name
Element
String
200
category
category
Element
String
100
quantity
quantity
Element
Integer
10
XPath Tips:
Use relative paths from the Loop XPath
skumeans "look for<sku>child element"For attributes, use
@attribute_nameFor nested elements, use
parent/child
Add Select values step
Name: "Map XML to Target Schema"
Select & Alter tab:
FieldnameRename toTypeLengthPrecisionsku
product_id
String
50
name
product_name
String
200
category
category
String
100
quantity
quantity
Integer
10
Add Add constants step
Name: "Add XML Metadata"
Fields tab:
Field nameTypeValuesource_system
String
xml
price
Number
null
Note: XML inventory doesn't have price data (warehouse doesn't track retail prices), so we explicitly set it to null
Add Get System Info step
Name: "Add XML Ingestion Timestamp"
Add field:
ingestion_time→ Type:system date (variable)
Expected Output (Preview):
Row Count: ~8-10 warehouse inventory items
Common XML Troubleshooting:
If no rows returned: Check Loop XPath is correct
If empty values: Verify field XPaths match element names (case-sensitive)
If namespace errors: Try checking "Ignore namespace" option
Merge Streams
Objective: Merge all three data streams (CSV, JSON, XML) into one unified stream.
Why Append Streams: This step stacks all rows from different sources vertically - like a SQL UNION ALL.
Configuration:
Add Append streams step
Name: "Combine All Products"
Connect all three streams to this step:
"Add Ingestion Timestamp" (CSV branch) → Append streams
"Add JSON Ingestion Timestamp" (JSON branch) → Append streams
"Add XML Ingestion Timestamp" (XML branch) → Append streams
Important: All input streams MUST have the same fields with the same names and types:
product_id (String)
product_name (String)
category (String)
price (Number) - can be null
quantity (Integer)
source_system (String)
ingestion_time (Timestamp)
Expected Output:
Row count: ~30-35 rows (12 CSV + 10-15 JSON + 8-10 XML)
All products from all sources combined
Some products will appear multiple times (duplicates to be handled in Step 7)
Preview Check:
x
Data Validation
Objective: Validate data quality and route bad records to error handling.
Why Important: Multi-source data often has quality issues. Better to catch and handle them explicitly than have them cause downstream failures.
Configuration:
Add Data Validator step
Name: "Validate Product Data"
Validations tab - Add validation rules:
FieldnameValidation TypeConfigurationError Messageproduct_id
NOT NULL
Product ID is required
product_id
NOT EMPTY STRING
Product ID cannot be empty
product_name
NOT NULL
Product name is required
product_name
NOT EMPTY STRING
Product name cannot be empty
price
NUMERIC RANGE
Min: 0, Max: 999999
Price must be >= 0 (if present)
quantity
NUMERIC RANGE
Min: 0, Max: 999999
Quantity must be >= 0
Options tab:
☑ Concatenate errors: Shows all validation errors for a row
Separator:
,(comma-space)☑ Output all errors as one field:
validation_errors
Add Filter rows step after Data Validator
Name: "Route Valid vs Invalid"
Condition:
True (valid records) → Continue to deduplication
False (invalid records) → Error output
Add Text file output for errors (connect from False branch):
Name: "Write Error Records"
Filename:
pvfs://MinIO/curated/products/errors/validation_errors_${Internal.Job.Start.Date.yyyyMMdd}.csvInclude date in filename: Helps track when errors occurred
Fields to output: All fields +
validation_errors
Expected Output:
Valid records: ~95-100% should pass (25-35 rows)
Invalid records: 0-5% to error file (0-2 rows)
Common Validation Failures:
Empty product_id or product_name
Negative price or quantity values
Non-numeric values in numeric fields
x
x
x
x
Installs and configures MinIO on Windows 11 running in Docker Desktop.
Create a MinIO folder and copy the required files.
Create directory & copy
Check the Directory has been created and the files copied over.
Execute the docker-compose script to create the container.
MinIO Container

Check the container is up and running in Desktop Docker.

Access MinIO UI:
Username: minioadmin
Password: minioadmin

The MinIO port has been changed to prevent conflicts.
The population scripts create realistic datasets in multiple formats commonly used in data integration workflows:
Data Sources Created
CSV Files - Structured tabular data
customers.csv- Customer records (12 entries)products.csv- Product catalog (12 entries)sales.csv- Sales transactions (15 entries)
JSON Files - API responses and configuration
api_response.json- Nested API response with ordersuser_events.json- Event stream (JSONL format)config.json- Application configuration
XML Files - Legacy system exports
inventory.xml- Warehouse inventory dataemployees.xml- HR employee records
Log Files - Application and system logs
application.log- Structured application logsaccess.log- Web server access logserror.log- Error and warning logs
Parquet Files (Optional - requires Python)
transactions.parquet- Big data format for analytics
Buckets Created
raw-data - Landing zone for raw source files
staging - Intermediate processing area
curated - Clean, processed data ready for consumption
logs - Application and process logs
archive - Historical data archives
Ensure MinIO API is accessible.

Install the prerequisite packages to generate data and run script:
Checks Dependencies - Verifies MinIO Client (mc) is installed
Tests Connectivity - Ensures MinIO is running and accessible
Configures Client - Sets up MinIO Client alias
Creates Buckets - Creates organizational buckets
Generates CSV Files - Creates customer, product, and sales data
Generates JSON Files - Creates API responses and configuration
Generates XML Files - Creates inventory and employee data
Generates Log Files - Creates realistic application logs
Uploads to MinIO - Copies all files to appropriate buckets
Check python3 is installed.
Install python3-venv package.
Install required packages.

Install MinIO Client (mc).
Verified its installed.
Execute install script.

x
Virtual File Systems
PDI allows you to establish connections to most Virtual File Systems (VFS) through VFS connections. These connections store the necessary properties to access specific file systems, eliminating the need to repeatedly enter configuration details.
Once you've added a VFS connection in PDI, you can reference it whenever you need to work with files or folders on that Virtual File System. This streamlines your workflow by allowing you to reuse connection information across multiple steps.
For instance, if you're working with Hitachi Content Platform (HCP), you can create a single VFS connection and then use it throughout all HCP transformation steps. This approach saves time and ensures consistency by removing the need to re-enter credentials or access information for each data operation.
Start Pentaho Data Integration.
Windows - PowerShell
Linux
Create a VFS connection to the MinIO buckets
Click: 'View' Tab.
Right mouse click on VFS Connections > New.

Enter the following details:

Connection Name
MinIO
Connection Type
Minio/HCP
Description
Connection to sales-data bucket
S3 Connection Type
Minio/HCP
Access Key
minioadmin
Secret Key
minioadmin
Endpoint
http://localhost:9000 [MinIO API endpoint]
Signature Version
AWSS3V4SignerType
PathStyle Access
enable
Root Folder Path
/
Test the connection.
In this hands-on workshop, you'll learn to deploy and configure an SMB (Server Message Block) server using Docker Desktop and on Linux.

Follow the instructions outlined below to deploy an SMB server on Ubuntu 24.04.
Ensure all installed Packages are up-to-date.
Install Samba server.
Make a copy of the existing configuration file and create a new
/etc/samba/smb.confconfiguration file
Any user existing on the samba user list must also exist within the
/etc/passwdfile.
Add the home directory share.
Copy & paste the following to the bottom of the file - private home & public access.
Save.
Create a directory that mounts public share and change its access permission.
Restart your samba server.
SMB Server
We're going to setup the Samba server with access to shareable, public directory - /var/samba/ - that can be accessed anonymously.
Next .. access to the 'pentaho user' - /pentaho/home directory. Obviously you'll need to be a registered user with a password to access the directory.
Let’s create some test files.
Public
A 'public' directory that be accessed from any machine ..
In File Explorer, select: + Other Locations.
Enter the following connection details:

Connect as: Anonymous.

You should see the public-share file.

Registered
Only registered users can access the /pentaho/home directory ..
In File Explorer, select: + Other Locations.
Enter the following connection details:

Connect as: Registered User.
Username: pentaho
Domain: WORKGROUP
Password: password

You should see the public-share file somewhere in the /home directory.
Follow the instructions below to deploy:
Follow the instructions below to deploy an SMB Docker container. We're going to deploy a very simple SMB server with 2 users:
Alice -
Bob -
As Windows already has a default SMB server running on port:445 so this is changed to 1445.
Pentaho Data Integration
(Optional) Download the latest jcifs driver.
(Optional) Copy the JCIFS JAR file into Pentaho Data Integration "lib" folder.
Download CIFS driver
Pentaho Data Integration ships with jcifs-1.3.3.jar
If you wish to replace the current driver, rename to: jcifs-1.3.3.jar -> jcifs-1.3.3.jar.bak
jcifs 2.1.40.jar driver has been downloaded to the Workshop--Data-Integration/Drivers
Create SMB Share Directories
Create a SMB folder and copy the required files. Will also add some sample data.
Create directory & copy - PowerShell
Check the Directory has been created and the files copied over.

x
x
Creating Local Windows Users
Let's add our SMB users to the system:
Bob -
Alice -
Right-click on the Start button.
Select Computer Management from the context menu.
Alternatively, press
Win + Xand select Computer Management.In Computer Management, expand System Tools.
Click on: Local Users and Groups.
Select Users folder.
Right-click in the Users pane (right side).
Select New User...
Fill in the following details:
Username: bob
Full Name: Bob Smith
Password: password

Click Create
Click Close
Repeat the workflow to create: alice

Configure Folder Permissions
Open File Explorer and navigate to
C:\SMB\BobRight-click on the Bob folder.
Select Properties.
Click the Security tab.
Click Edit...
Click Add...
Type
bobin the text box and click: Check NamesClick OK.
Select: bob in the permissions list.

Check the following permissions:
Click OK twice

Repeat the workflow for alice's folder.

SMB Share
Right-click on the
C:\SMB\Bobfolder.Select Properties.
Click the Sharing tab.
Click Advanced Sharing...

☑ Check "Share this folder"
Share name: Bob (default is fine).

Click Permissions.
Remove "Everyone" if present (select and click Remove).
Click Add...

Add the following user:
Type
bob, click Check Names, click OK

Set permissions for Bob Smith:
Select Bob Smith: ☑ Full Control

Click OK three times.
In the Sharing tab, note the Network Path: \\[Computer Name]\Bob
From another computer: \\[your-computer-ip]\Bob

Obviously if you want to play around with the SMB shares for Alice & 'Shared' then you will have to repeat the workflow.
Test
Time to test ..
Let's see if we can access C:\SMB\Bob from across the Network.
In the File Explorer or Run command , enter the following UNC path to access C:\SMB\Bob
You should see the following popup message displayed:

x
Last updated
Was this helpful?

