Storage

Setup Object Stores & SMB ..

Object Stores

Object storage systems like Amazon S3 and MinIO provide a way to store and retrieve large amounts of unstructured data such as files, images, videos, and backups through a simple web-based API. Unlike traditional file systems that organize data in hierarchical folders, object stores use a flat namespace where each piece of data (called an object) is stored in containers called buckets and accessed via unique keys or URLs.

Amazon S3 is AWS's flagship object storage service that offers virtually unlimited scalability, multiple storage classes for different use cases, and integration with other AWS services.

MinIO is an open-source alternative that provides S3-compatible APIs and can be deployed on-premises or in private clouds, making it popular for organizations that want object storage capabilities without vendor lock-in.

Both systems are designed for high durability, availability, and can handle massive scale while providing simple REST API access for applications to store and retrieve data programmatically.

The following steps are intended for setting up a Pentaho Lab environment and need to be completed in order to complete the Workshops.

Ensure you have downloaded the Workshop--Installation:

cd
git clone https://github.com/jporeilly/Workshop--Data-Integration

To install git:

sudo apt install git

Prerequisites

Ubuntu 24.04 LTS system (physical or virtual machine)
User account with sudo privileges
Internet connection
Basic familiarity with Linux command line

MinIO

Follow the instructions below to setup a MinIO Docker Container.

Select your OS & add the Sample Data, finally configure a VFS connection in Data Integration:

Installs and configures MinIO on Ubuntu 24.04 running in Docker.

Create a MinIO folder and copy the required files.

Create directory & copy

cd
cd ~/Workshop--Data-Integration/Setup/MinIO/linux
chmod +x ./copy-minio.sh && sudo ./copy-minio.sh

Ensure all the files have successfully been copied over.

cd 
cd /opt/minio
ls -al

Execute the docker-compose script to create the container.

MinIO Container

cd
cd /opt/minio
sudo ./run-docker-minio.sh

Following best industry practices, MinIO is installed as root in the /opt/minio directory. If you wish the pentaho user to also manage the service then you may need to add the user to the Docker group.

# Install to /opt/minio (system-wide)
sudo ./copy-minio.sh

# Add pentaho user to docker group (no sudo needed for Docker commands)
sudo usermod -aG docker pentaho

# Pentaho user can now manage MinIO without sudo
docker compose -f /opt/minio/docker-compose.yml ps
docker compose -f /opt/minio/docker-compose.yml logs

Check the container is up and running in Docker.

docker ps --filter "name=minio"

Log into MinIO.

http://localhost:9002/loginlocalhost

Username: minioadmin

Password: minioadmin

If you have completed the setup: MinIO then you should have pre-populated buckets with various data objects in different formats.

New Bucket

If you need to create a Bucket:

Click the 'Create Bucket' link.
Enter: sales-data & 'Create Bucket'.

Click on the Upload button.

Upload your data - for example some sales data:

Windows - PowerShell

Set-Location C:\Pentaho\design-tools\data-integration\samples\transformations\files

Linux

cd
cd ~/Pentaho/design-tools/data-integration/samples/transformations/files

Workshops

Pentaho Data Integration: MinIO Object Storage Workshop Series

Modern organizations increasingly store their data in cloud-native object storage systems like MinIO and Amazon S3, moving away from traditional file servers and databases. This architectural shift enables scalable, cost-effective data lakes but introduces new challenges: data arrives in multiple formats (CSV, JSON, XML, Parquet), exists across distributed buckets, and requires sophisticated transformation pipelines to unlock its analytical value. Learning to efficiently extract, transform, and integrate data from object storage is now essential for any data integration professional working with contemporary data architectures.

In this comprehensive workshop series, you'll build progressively complex transformation pipelines that leverage MinIO object storage as both a source and destination for enterprise data integration scenarios. Starting with fundamental ETL patterns like denormalized fact table creation, you'll advance through intermediate challenges involving multi-format parsing and reconciliation, ultimately mastering advanced techniques like sessionization, anomaly detection, and schema normalization across heterogeneous data sources.

Each workshop introduces real-world business scenarios - from sales dashboards to customer analytics to operational monitoring - demonstrating PDI's versatility in solving diverse integration challenges while maintaining cloud-native architecture principles.

What You'll Accomplish:

Configure VFS (Virtual File System) connections to access S3-compatible MinIO object storage
Build multi-source ETL pipelines using Text File Input steps with S3 paths (s3a://)
Implement Stream Lookup and Merge Join patterns to enrich data from multiple CSV sources
Parse semi-structured formats including XML inventory feeds and JSONL event streams
Apply full outer joins to identify discrepancies between warehouse and catalog systems
Aggregate customer data across transactional, demographic, and behavioral dimensions
Perform sessionization and funnel analysis on clickstream data using Group By and pivoting
Extract structured data from unstructured logs using Regular Expression evaluation
Detect anomalies in time-series data through rolling averages and conditional logic
Normalize schemas across CSV, JSON, and XML sources into unified data lake structures
Implement data validation, deduplication, and quality controls for multi-format ingestion
Calculate derived metrics including customer lifetime value, engagement scores, and conversion rates
Route data dynamically using Switch/Case and Filter Rows for conditional processing
Output transformed data to staging and curated layers following data lake architecture patterns

By the end of this workshop series, you'll have mastered the complete spectrum of cloud-native data integration patterns using Pentaho Data Integration. You'll understand how to handle diverse source formats, implement sophisticated join and aggregation logic, perform advanced text parsing and time-series analysis, and build production-ready pipelines that leverage object storage for scalable, distributed data processing.

Instead of treating each data format as a unique challenge requiring custom scripts, you'll confidently design reusable, visual transformation workflows that automate complex integration scenarios - from operational reconciliation to customer intelligence to real-time anomaly detection.

Prerequisites: MinIO running with sample data populated; basic understanding of transformation concepts (steps, hops, preview); familiarity with joins and aggregations

Estimated Time: 4-6 hours total (individual workshops range from 20-60 minutes based on complexity)

Workshop

Key Skills

Sales Dashboard

joins, lookups, aggregations

Inventory Reconciliation

XML parsing, outer joins, variance

Customer 360

multi-source, JSONL, calculations

Clickstream Funnel

sessionization, pivoting

Log Parsing

regex, time-series analysis

Data Lake Ingestion

schema normalization, validation

1. Verify that MinIO is running and populated.

# Check MinIO is running
curl -sf http://localhost:9000/minio/health/live && echo "MinIO OK" || echo "MinIO not running"

# Verify data exists (using mc client)
mc ls minio-local/raw-data --recursive

Start Pentaho Data Integration.

Windows - PowerShell:

Set-Location C:\Pentaho\design-tools\data-integration
.\spoon.bat

Linux:

cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh

Workshops

Sales Dashboard

The workshop demonstrates how Pentaho Data Integration enables organizations to rapidly create denormalized fact tables that power real-time business intelligence dashboards. By integrating data from multiple sources (customer data, product catalogs, and sales transactions), business users gain immediate access to actionable insights without waiting for IT to build complex data warehouses.

Scenario: A mid-sized e-commerce company needs to track daily sales performance across products, customer segments, and regions. Currently, sales managers wait 24-48 hours for IT to generate reports from disparate systems. With PDI, they can automate this process and refresh dashboards hourly.

Key Stakeholders:

Sales Directors: Need to identify top-performing products and regions
Marketing Teams: Require customer segmentation for targeted campaigns
Finance: Need accurate revenue reporting by product category
Operations: Must monitor inventory turnover rates

Follow the steps to create the transformation:

Text File Input

The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.

The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

Drag & drop 3 Text File Input Steps onto the canvas.
Save transformation as: sales_dashboard_etl.ktr in your workshop folder.

Sales (Order Management)

Double-click on the first TFI step, and configure with the following properties:

Setting

Value

Step name

Sales

Filename

pvfs://Minio/raw-data/csv/sales.csv

Delimiter

Head row present

✅

Format

mixed

Click: Get Fields to auto-detect columns.

Business Logic: Note that sale_amount may differ from price * quantity due to:

Volume discounts
Promotional pricing
Customer-specific pricing tiers
Currency conversion (for international sales)

Preview data.

Business Significance:

sale_amount: Actual revenue (may include discounts)
quantity: Volume metrics for demand planning
payment_method: Payment preference insights
status: Filter out cancelled/refunded orders

Products (ERP system)

Double-click on the second TFI step, and configure with the following properties:

Setting

Value

Step name

Products

Filename

pvfs://Minio/raw-data/csv/products.csv

Delimiter

Head row present

✅

Format

mixed

Click: Get Fields to auto-detect columns.

Preview the data.

Business Significance:

category: Enables product performance analysis by segment
price: Base pricing for margin calculations
stock_quantity: Inventory turnover insights

Customers (CRM System)

Double-click on the third TFI step, and configure with the following properties:

Setting

Value

Step name

Customers

Filename

pvfs://MinIO/raw-data/csv/customers.csv

Delimiter

Header row present

✅

Format

mixed

Click: Get Fields to auto-detect columns.

Preview the data.

Business Significance:

customer_id: Primary key for joining to sales
country: Critical for geographic segmentation
status: Identifies churned vs. active customers
registration_date: Enables customer tenure analysis

Inventory Reconciliation - XML + CSV Integration

This workshop demonstrates how Pentaho Data Integration eliminates costly inventory discrepancies by automatically reconciling data between warehouse management systems (XML feeds) and ERP product catalogs (CSV files). Organizations lose millions annually due to inventory inaccuracies, stockouts, and overstocking. PDI's ability to parse complex XML and perform full outer joins enables real-time discrepancy detection that would require hours of manual spreadsheet work.

Business Value Delivered:

Cost Reduction: Eliminate manual reconciliation labor ($75K-150K annually per analyst)
Inventory Optimization: Reduce excess inventory carrying costs by 15-25%
Stockout Prevention: Identify missing items before customers notice
Compliance: Audit trail for SOX, ISO 9001, and supply chain regulations
Real-Time Visibility: Know your actual inventory position within minutes, not days

Scenario: A manufacturing company operates 12 distribution warehouses. Each warehouse uses a legacy WMS (Warehouse Management System) that exports XML inventory files nightly. The corporate ERP system maintains a CSV product master catalog. Discrepancies cause:

Phantom stock: ERP shows item in stock, warehouse says it's not → Lost sales
Ghost inventory: Warehouse has items ERP doesn't recognize → Dead capital
Quantity variances: Mismatches of 10+ units trigger expensive physical counts

Key Stakeholders:

Supply Chain Directors: Need accurate inventory positions across all locations
Warehouse Managers: Require daily reconciliation reports to prioritize cycle counts
Finance Teams: Must report accurate inventory valuations for financial statements
Procurement: Need to identify slow-moving items and prevent overstocking

Follow the steps to create the transformation:

Get data form XML

Drag & drop 'Get data from XML' onto the canvas.
Save transformation as: sales_dashboard_etl.ktr in your workshop folder.
Double-click on the 'Get data from XML' step, and configure with the following properties:

Setting

Value

Step name

Read Warehouse XML

File or directory

pvfs://MinIO/raw-data/xml/inventory.xml

Loop XPath

/inventory/items/item

Encoding

UTF-8

Ignore comments

✅

Validate XML

Ignore empty file

✅

XPath Explanation:

/inventory = Start at root element
/items = Navigate to items container
/item = Loop over each item element

Browse & Add the path to the inventory.xml

Click on the Content tab

Click on the Fields tab & Get Fields.
Remap the fields & Preview rows.

Business Field Naming:

Prefix with warehouse_ to distinguish from ERP fields later
warehouse_quantity vs. stock_quantity makes joins clearer
Keep original field names in a data dictionary for auditing

Name

XPath

warehouse_item_name

name

warehouse_quantity

quantity

warehouse_location

location

last_physical_count

last_checked

Data Lake Ingestion

Modern data lakes often receive the same entities (products, customers, orders) from multiple sources in different formats. This workshop demonstrates how to ingest, normalize, validate, and deduplicate multi-format data into a unified schema - a common data engineering pattern.

Objective: Combine data from CSV, JSON, and XML into a unified product schema.

Skills: Multi-format parsing, schema normalization, data validation, deduplication

Define Target Schema

Objective: Design a unified schema that accommodates all source formats.

Why Important: Before ingesting data, you need a clear target schema. This ensures consistency across all sources and makes downstream analytics easier.

Field

Type

Length

Description

Source Mapping

product_id

String

Unique product identifier

CSV: product_id JSON: product_id XML: sku

product_name

String

200

Product display name

CSV: product_name JSON: product_name XML: name

Schema Discovery & Analysis

Objective: Understand each source structure before you design the target schema.

Why it matters: You can’t normalize what you haven’t inspected.

Inspect each source

Use real samples. Avoid guessing field names.

Inspect the file

mc cat minio-local/raw-data/csv/products.csv | head -5

Sample output

product_id,product_name,category,price,stock_quantity
PROD-001,Laptop Pro 15,Electronics,999.99,50
PROD-002,Office Chair,Furniture,299.99,100
PROD-003,Coffee Maker,Appliances,79.99,200

Findings

Has product_id, product_name, category, price, stock_quantity.
Completeness looks high.
Naming is consistent and explicit.

Inspect one nested item

mc cat minio-local/raw-data/json/api_response.json | jq '.data.orders[0].items[0]'

Sample output

{
  "product_id": "PROD-001",
  "product_name": "Laptop Pro 15",
  "unit_price": 999.99,
  "quantity": 2
}

Findings

Has product_id and product_name.
Uses unit_price instead of price.
quantity is order quantity, not stock.
category is missing.
Path is $.data.orders[*].items[*].

Inspect one item node

mc cat minio-local/raw-data/xml/inventory.xml | grep -A 6 "<item>" | head -10

Sample output

<item>
  <sku>PROD-001</sku>
  <name>Laptop Pro 15</name>
  <category>Electronics</category>
  <quantity>50</quantity>
  <location>A-15</location>
</item>

Findings

Uses sku for product_id.
Uses name for product_name.
Has category and warehouse quantity.
price is missing.
location is extra for a product master.

Build a field mapping matrix

This shows name differences and missing fields.

Unified field

CSV

JSON

XML

Notes

Identifier

product_id

sku

Same meaning. Different name in XML.

Name

product_name

name

Same meaning. Different name in XML.

Step 2: Ingest CSV Products

Objective: Read CSV products and map to target schema.

Why CSV First: CSV is the simplest format and requires minimal transformation - good for testing your target schema.

Configuration:

Add Text file input step
- Name: "Read CSV Products"
- File/directory: pvfs://MinIO/raw-data/csv/products.csv
- Separator: Comma (,)
- Enclosure: " (double quote)
- Header: ☑ Header row present
Fields tab (click "Get Fields"):
Name
Type
Format
Length
Precision
product_id
String
50
product_name
String
200
category
String
100
price
Number
#.##
15
2
stock_quantity
Integer
10
Add Select values step
- Name: "Map CSV to Target Schema"
Select & Alter tab - Rename fields to match target:
Fieldname (from previous)
Rename to
Type
Length
Precision
product_id
product_id
String
50
product_name
product_name
String
200
category
category
String
100
price
price
Number
15
2
stock_quantity
quantity
Integer
10
Add Add constants step
- Name: "Add CSV Metadata"
Fields tab:
Field name
Type
Value
source_system
String
csv
Add Get System Info step
- Name: "Add Ingestion Timestamp"
Fields tab:
Name
Type
ingestion_time
system date (variable)

Expected Output (Preview):

product_id   product_name      category      price   quantity  source_system  ingestion_time
PROD-001     Laptop Pro 15     Electronics   999.99  50        csv            2024-01-23 10:30:45
PROD-002     Office Chair      Furniture     299.99  100       csv            2024-01-23 10:30:45
PROD-003     Coffee Maker      Appliances    79.99   200       csv            2024-01-23 10:30:45

Row Count: 12 products from CSV

Step 3: Ingest JSON Products

Objective: Extract product data from nested JSON structure.

Why More Complex: JSON often contains nested structures. You need to use JSONPath to navigate to the data you want.

JSON Structure Overview:

{
  "data": {
    "orders": [
      {
        "order_id": "ORD-001",
        "items": [
          {
            "product_id": "PROD-001",
            "product_name": "Laptop Pro 15",
            "unit_price": 999.99,
            "quantity": 2
          }
        ]
      }
    ]
  }
}

JSONPath: $.data.orders[*].items[*] extracts all items from all orders.

Configuration:

Add JSON Input step
- Name: "Read JSON Products"
- File tab:
  - File or directory: pvfs://MinIO/raw-data/json/api_response.json
  - Include subfolders: ☐ No
Content tab:
- Source is defined in a field: ☐ No
- Source is a URL: ☐ No
- Ignore empty file: ☑ Yes
- Do not raise error if no files: ☐ No
- Limit: 0 (no limit)
Fields tab:
Click Select fields and set JSONPath:
Name
Path
Type
Length
Precision
product_id
$.product_id
String
50
product_name
$.product_name
String
200
unit_price
$.unit_price
Number
15
2
quantity
$.quantity
Integer
10
Important: The base path $.data.orders[*].items[*] is entered in the Content tab → Path field (if available in your PDI version), or you use full paths like $.data.orders[*].items[*].product_id
Add Select values step
- Name: "Map JSON to Target Schema"
Select & Alter tab:
Fieldname
Rename to
Type
Length
Precision
product_id
product_id
String
50
product_name
product_name
String
200
unit_price
price
Number
15
2
quantity
quantity
Integer
10
Add Add constants step
- Name: "Add JSON Metadata"
Fields tab:
Field name
Type
Value
source_system
String
json
category
String
E-commerce
Note: JSON doesn't have category, so we set a default value "E-commerce"
Add Get System Info step
- Name: "Add JSON Ingestion Timestamp"
- Add field: ingestion_time → Type: system date (variable)

Expected Output (Preview):

product_id   product_name      category      price   quantity  source_system  ingestion_time
PROD-001     Laptop Pro 15     E-commerce    999.99  2         json           2024-01-23 10:30:46
PROD-005     Desk Lamp         E-commerce    45.00   3         json           2024-01-23 10:30:46
PROD-007     USB-C Cable       E-commerce    19.99   5         json           2024-01-23 10:30:46

Row Count: ~10-15 product entries (some may be duplicates from different orders)

Step 4: Ingest XML Products

Objective: Extract product data from XML using XPath.

Why Different: XML uses hierarchical structure with tags. XPath (like SQL for XML) lets you query specific elements.

XML Structure Overview:

<?xml version="1.0" encoding="UTF-8"?>
<inventory>
  <warehouse>Warehouse A</warehouse>
  <items>
    <item>
      <sku>PROD-001</sku>
      <name>Laptop Pro 15</name>
      <category>Electronics</category>
      <quantity>50</quantity>
      <location>A-15</location>
    </item>
    <item>
      <sku>PROD-002</sku>
      <name>Office Chair</name>
      <category>Furniture</category>
      <quantity>100</quantity>
      <location>B-23</location>
    </item>
  </items>
</inventory>

XPath: /inventory/items/item selects all <item> nodes

Configuration:

Add Get data from XML step
- Name: "Read XML Products"
File tab:
- File or directory: pvfs://MinIO/raw-data/xml/inventory.xml
- Include subfolders: ☐ No
Content tab:
- Loop XPath: /inventory/items/item
- Encoding: UTF-8
- Namespace aware: ☐ No (unless XML has namespaces)
- Ignore comments: ☑ Yes
- Validate: ☐ No
Fields tab - Define fields to extract:
Name
XPath
Element type
Type
Length
Format
sku
sku
Element
String
50
name
name
Element
String
200
category
category
Element
String
100
quantity
quantity
Element
Integer
10
XPath Tips:
- Use relative paths from the Loop XPath
- sku means "look for <sku> child element"
- For attributes, use @attribute_name
- For nested elements, use parent/child
Add Select values step
- Name: "Map XML to Target Schema"
Select & Alter tab:
Fieldname
Rename to
Type
Length
Precision
sku
product_id
String
50
name
product_name
String
200
category
category
String
100
quantity
quantity
Integer
10
Add Add constants step
- Name: "Add XML Metadata"
Fields tab:
Field name
Type
Value
source_system
String
xml
price
Number
null
Note: XML inventory doesn't have price data (warehouse doesn't track retail prices), so we explicitly set it to null
Add Get System Info step
- Name: "Add XML Ingestion Timestamp"
- Add field: ingestion_time → Type: system date (variable)

Expected Output (Preview):

product_id   product_name      category      price   quantity  source_system  ingestion_time
PROD-001     Laptop Pro 15     Electronics   null    50        xml            2024-01-23 10:30:47
PROD-002     Office Chair      Furniture     null    100       xml            2024-01-23 10:30:47
PROD-004     Monitor 27"       Electronics   null    75        xml            2024-01-23 10:30:47

Row Count: ~8-10 warehouse inventory items

Common XML Troubleshooting:

If no rows returned: Check Loop XPath is correct
If empty values: Verify field XPaths match element names (case-sensitive)
If namespace errors: Try checking "Ignore namespace" option

Merge Streams

Objective: Merge all three data streams (CSV, JSON, XML) into one unified stream.

Why Append Streams: This step stacks all rows from different sources vertically - like a SQL UNION ALL.

Configuration:

Add Append streams step
- Name: "Combine All Products"
Connect all three streams to this step:
- "Add Ingestion Timestamp" (CSV branch) → Append streams
- "Add JSON Ingestion Timestamp" (JSON branch) → Append streams
- "Add XML Ingestion Timestamp" (XML branch) → Append streams
Important: All input streams MUST have the same fields with the same names and types:
- product_id (String)
- product_name (String)
- category (String)
- price (Number) - can be null
- quantity (Integer)
- source_system (String)
- ingestion_time (Timestamp)

Expected Output:

Row count: ~30-35 rows (12 CSV + 10-15 JSON + 8-10 XML)
All products from all sources combined
Some products will appear multiple times (duplicates to be handled in Step 7)

Preview Check:

product_id   product_name      source_system  price
PROD-001     Laptop Pro 15     csv            999.99
PROD-002     Office Chair      csv            299.99
...
PROD-001     Laptop Pro 15     json           999.99   ← Duplicate!
PROD-005     Desk Lamp         json           45.00
...
PROD-001     Laptop Pro 15     xml            null     ← Duplicate, no price
PROD-002     Office Chair      xml            null

Data Validation

Objective: Validate data quality and route bad records to error handling.

Why Important: Multi-source data often has quality issues. Better to catch and handle them explicitly than have them cause downstream failures.

Configuration:

Add Data Validator step
- Name: "Validate Product Data"
Validations tab - Add validation rules:
Fieldname
Validation Type
Configuration
Error Message
product_id
NOT NULL
Product ID is required
product_id
NOT EMPTY STRING
Product ID cannot be empty
product_name
NOT NULL
Product name is required
product_name
NOT EMPTY STRING
Product name cannot be empty
price
NUMERIC RANGE
Min: 0, Max: 999999
Price must be >= 0 (if present)
quantity
NUMERIC RANGE
Min: 0, Max: 999999
Quantity must be >= 0
Options tab:
- ☑ Concatenate errors: Shows all validation errors for a row
- Separator: , (comma-space)
- ☑ Output all errors as one field: validation_errors
Add Filter rows step after Data Validator
- Name: "Route Valid vs Invalid"
Condition:
```
validation_errors IS NULL
```
- True (valid records) → Continue to deduplication
- False (invalid records) → Error output
Add Text file output for errors (connect from False branch):
- Name: "Write Error Records"
- Filename: pvfs://MinIO/curated/products/errors/validation_errors_${Internal.Job.Start.Date.yyyyMMdd}.csv
- Include date in filename: Helps track when errors occurred
- Fields to output: All fields + validation_errors

Expected Output:

Valid records: ~95-100% should pass (25-35 rows)
Invalid records: 0-5% to error file (0-2 rows)

Common Validation Failures:

Empty product_id or product_name
Negative price or quantity values
Non-numeric values in numeric fields

Installs and configures MinIO on Windows 11 running in Docker Desktop.

Create a MinIO folder and copy the required files.

Create directory & copy

cd \
cd C:\Workshop--Data-Integration\Setup\MinIO\windows
.\copy-minio.ps1

Check the Directory has been created and the files copied over.
Execute the docker-compose script to create the container.

MinIO Container

cd \
cd C:\MinIO
.\run-docker-minio.ps1

Check the container is up and running in Desktop Docker.

Access MinIO UI:

Username: minioadmin

Password: minioadmin

http://localhost:9000localhost

The MinIO port has been changed to prevent conflicts.

The population scripts create realistic datasets in multiple formats commonly used in data integration workflows:

Data Sources Created

CSV Files - Structured tabular data
- customers.csv - Customer records (12 entries)
- products.csv - Product catalog (12 entries)
- sales.csv - Sales transactions (15 entries)
JSON Files - API responses and configuration
- api_response.json - Nested API response with orders
- user_events.json - Event stream (JSONL format)
- config.json - Application configuration
XML Files - Legacy system exports
- inventory.xml - Warehouse inventory data
- employees.xml - HR employee records
Log Files - Application and system logs
- application.log - Structured application logs
- access.log - Web server access logs
- error.log - Error and warning logs
Parquet Files (Optional - requires Python)
- transactions.parquet - Big data format for analytics

Buckets Created

raw-data - Landing zone for raw source files
staging - Intermediate processing area
curated - Clean, processed data ready for consumption
logs - Application and process logs
archive - Historical data archives

Ensure MinIO API is accessible.

curl -sf http://localhost:9000/minio/health/live && echo "MinIO is running" || echo "MinIO is not responding"

Install the prerequisite packages to generate data and run script:

Checks Dependencies - Verifies MinIO Client (mc) is installed
Tests Connectivity - Ensures MinIO is running and accessible
Configures Client - Sets up MinIO Client alias
Creates Buckets - Creates organizational buckets
Generates CSV Files - Creates customer, product, and sales data
Generates JSON Files - Creates API responses and configuration
Generates XML Files - Creates inventory and employee data
Generates Log Files - Creates realistic application logs
Uploads to MinIO - Copies all files to appropriate buckets

Check python3 is installed.

python3 -V

Install python3-venv package.

sudo apt install python3.12-venv

Install required packages.

# Create venv
python3 -m venv ~/venv

# Activate it
source ~/venv/bin/activate

# Install packages
pip install pandas pyarrow

# When done, deactivate
deactivate

# Optional: Install utilities
sudo apt install curl jq

Install MinIO Client (mc).

wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/

Verified its installed.

mc --version

Execute install script.

cd
cd /opt/minio

# Run the script
sudo ./populate-minio.sh

Virtual File Systems

PDI allows you to establish connections to most Virtual File Systems (VFS) through VFS connections. These connections store the necessary properties to access specific file systems, eliminating the need to repeatedly enter configuration details.

Once you've added a VFS connection in PDI, you can reference it whenever you need to work with files or folders on that Virtual File System. This streamlines your workflow by allowing you to reuse connection information across multiple steps.

For instance, if you're working with Hitachi Content Platform (HCP), you can create a single VFS connection and then use it throughout all HCP transformation steps. This approach saves time and ensures consistency by removing the need to re-enter credentials or access information for each data operation.

Start Pentaho Data Integration.

Windows - PowerShell

Set-Location C:\Pentaho\design-tools\data-integration
.\spoon.bat

Linux

cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh

Create a VFS connection to the MinIO buckets

Click: 'View' Tab.
Right mouse click on VFS Connections > New.

Enter the following details:

Setting

Value

Connection Name

MinIO

Connection Type

Minio/HCP

Description

Connection to sales-data bucket

S3 Connection Type

Minio/HCP

Access Key

minioadmin

Secret Key

minioadmin

Endpoint

http://localhost:9000 [MinIO API endpoint]

Signature Version

AWSS3V4SignerType

PathStyle Access

enable

Root Folder Path

Test the connection.

In this hands-on workshop, you'll learn to deploy and configure an SMB (Server Message Block) server using Docker Desktop and on Linux.

Follow the instructions outlined below to deploy an SMB server on Ubuntu 24.04.

Ensure all installed Packages are up-to-date.

sudo apt update && sudo apt upgrade -y

Install Samba server.

sudo apt install tasksel
sudo tasksel install samba-server

Make a copy of the existing configuration file and create a new /etc/samba/smb.conf configuration file

sudo cp /etc/samba/smb.conf /etc/samba/smb.conf_backup
sudo bash -c 'grep -v -E "^#|^;" /etc/samba/smb.conf_backup | grep . > /etc/samba/smb.conf'

Any user existing on the samba user list must also exist within the /etc/passwd file.

sudo smbpasswd -a pentaho

New SMB password: password
Retype new SMB password: password
Added user pentaho.

Add the home directory share.

sudo nano /etc/samba/smb.conf

Copy & paste the following to the bottom of the file - private home & public access.

[homes]
   comment = Home Directories
   browseable = yes
   read only = no
   create mask = 0700
   directory mask = 0700
   valid users = %S
[public]
  comment = public anonymous access
  path = /var/samba/
  browsable =yes
  create mask = 0660
  directory mask = 0771
  writable = yes
  guest ok = yes

Save.

CTRL + O
Enter
CTRL + X

Create a directory that mounts public share and change its access permission.

sudo mkdir /var/samba
sudo chmod 777 /var/samba/

Restart your samba server.

sudo systemctl restart smbd

SMB Server

We're going to setup the Samba server with access to shareable, public directory - /var/samba/ - that can be accessed anonymously.

Next .. access to the 'pentaho user' - /pentaho/home directory. Obviously you'll need to be a registered user with a password to access the directory.

Let’s create some test files.

touch /var/samba/public-share 
touch /home/pentaho/home-share

Public

A 'public' directory that be accessed from any machine ..

In File Explorer, select: + Other Locations.
Enter the following connection details:

smb://pentaho.local/public/

Connect as: Anonymous.

You should see the public-share file.

Registered

Only registered users can access the /pentaho/home directory ..

In File Explorer, select: + Other Locations.
Enter the following connection details:

smb://pentaho.local/homes/

Connect as: Registered User.

Username: pentaho

Domain: WORKGROUP

Password: password

You should see the public-share file somewhere in the /home directory.
Follow the instructions below to deploy:

Follow the instructions below to deploy an SMB Docker container. We're going to deploy a very simple SMB server with 2 users:

Alice -

Bob -

As Windows already has a default SMB server running on port:445 so this is changed to 1445.

Pentaho Data Integration

(Optional) Download the latest jcifs driver.

(Optional) Copy the JCIFS JAR file into Pentaho Data Integration "lib" folder.

Download CIFS driver

Pentaho Data Integration ships with jcifs-1.3.3.jar

If you wish to replace the current driver, rename to: jcifs-1.3.3.jar -> jcifs-1.3.3.jar.bak

jcifs 2.1.40.jar driver has been downloaded to the Workshop--Data-Integration/Drivers

Create SMB Share Directories

Create a SMB folder and copy the required files. Will also add some sample data.

Create directory & copy - PowerShell

cd \
& "C:\Workshop--Data-Integration\Setup\SMB\copy-smb.ps1"

Check the Directory has been created and the files copied over.

Creating Local Windows Users

Let's add our SMB users to the system:

Bob -

Alice -

Right-click on the Start button.
Select Computer Management from the context menu.
Alternatively, press Win + X and select Computer Management.
In Computer Management, expand System Tools.
Click on: Local Users and Groups.
Select Users folder.
Right-click in the Users pane (right side).
Select New User...
Fill in the following details:

Username: bob

Full Name: Bob Smith

Password: password

Click Create
Click Close
Repeat the workflow to create: alice

SMB Share

Right-click on the C:\SMB\Bob folder.
Select Properties.
Click the Sharing tab.
Click Advanced Sharing...

☑ Check "Share this folder"
Share name: Bob (default is fine).

Click Permissions.
Remove "Everyone" if present (select and click Remove).
Click Add...

Add the following user:
- Type bob, click Check Names, click OK

Set permissions for Bob Smith:
- Select Bob Smith: ☑ Full Control

Click OK three times.

In the Sharing tab, note the Network Path: \\[Computer Name]\Bob

From another computer: \\[your-computer-ip]\Bob

Obviously if you want to play around with the SMB shares for Alice & 'Shared' then you will have to repeat the workflow.

Test

Time to test ..

Let's see if we can access C:\SMB\Bob from across the Network.

In the File Explorer or Run command , enter the following UNC path to access C:\SMB\Bob

\\[Computer Name]\Bob or \\localhost\Bob

You should see the following popup message displayed:

PreviousDatabases NextBig Data

Last updated 13 days ago

Was this helpful?

Good morning

Storage

Object Stores

Workshops

Sales Dashboard

Text File Input

Stream Lookup

Calculator

Formula

Add Constants

Get system info

Select Values

Text file output

Inventory Reconciliation - XML + CSV Integration

Get data form XML

Customer 360

Text file input

Sort

Group by

Text file input

Log Parsing and Anomaly Detection

Transactions & Fraud Detection

Financial Data

Data Lake Ingestion

Define Target Schema

Schema Discovery & Analysis

Step 2: Ingest CSV Products

Step 3: Ingest JSON Products

Step 4: Ingest XML Products

Merge Streams

Data Validation

Good morning

hashtagObject Stores

hashtagWorkshops

hashtagSales Dashboard

hashtagText File Input

hashtagStream Lookup

hashtagCalculator

hashtagFormula

hashtagAdd Constants

hashtagGet system info

hashtagSelect Values

hashtagText file output

hashtagInventory Reconciliation - XML + CSV Integration

hashtagGet data form XML

hashtagCustomer 360

hashtagText file input

hashtagSort

hashtagGroup by

hashtagText file input

hashtagLog Parsing and Anomaly Detection

hashtagTransactions & Fraud Detection

hashtagFinancial Data

hashtagData Lake Ingestion

hashtagDefine Target Schema

hashtagSchema Discovery & Analysis

hashtagStep 2: Ingest CSV Products

hashtagStep 3: Ingest JSON Products

hashtagStep 4: Ingest XML Products

hashtagMerge Streams

hashtagData Validation

Object Stores

Workshops

Sales Dashboard

Text File Input

Stream Lookup

Calculator

Formula

Add Constants

Get system info

Select Values

Text file output

Inventory Reconciliation - XML + CSV Integration

Get data form XML

Customer 360

Text file input

Sort

Group by

Text file input

Log Parsing and Anomaly Detection

Transactions & Fraud Detection

Financial Data

Data Lake Ingestion

Define Target Schema

Schema Discovery & Analysis

Step 2: Ingest CSV Products

Step 3: Ingest JSON Products

Step 4: Ingest XML Products

Merge Streams

Data Validation