Storage

Setup Object Stores & SMB ..

Object Stores

Object storage systems like Amazon S3 and MinIO provide a way to store and retrieve large amounts of unstructured data such as files, images, videos, and backups through a simple web-based API. Unlike traditional file systems that organize data in hierarchical folders, object stores use a flat namespace where each piece of data (called an object) is stored in containers called buckets and accessed via unique keys or URLs.

Amazon S3 is AWS's flagship object storage service that offers virtually unlimited scalability, multiple storage classes for different use cases, and integration with other AWS services.

MinIO is an open-source alternative that provides S3-compatible APIs and can be deployed on-premises or in private clouds, making it popular for organizations that want object storage capabilities without vendor lock-in.

Both systems are designed for high durability, availability, and can handle massive scale while providing simple REST API access for applications to store and retrieve data programmatically.

Storage

Prerequisites

  • Ubuntu 24.04 LTS system (physical or virtual machine)

  • User account with sudo privileges

  • Internet connection

  • Basic familiarity with Linux command line

MinIO

Follow the instructions below to setup a MinIO Docker Container.

Select your OS & add the Sample Data, finally configure a VFS connection in Data Integration:

Installs and configures MinIO on Ubuntu 24.04 running in Docker.

  1. Create a MinIO folder and copy the required files.

Create directory & copy

copy-minio.sh
  1. Ensure all the files have successfully been copied over.

  1. Execute the docker-compose script to create the container.

MinIO Container

run-docker-minio.sh
  1. Check the container is up and running in Docker.

Check Docker minio container

  1. Log into MinIO.

Username: minioadmin

Password: minioadmin

If you have completed the setup: MinIO then you should have pre-populated buckets with various data objects in different formats.


New Bucket

If you need to create a Bucket:

  1. Click the 'Create Bucket' link.

  2. Enter: sales-data & 'Create Bucket'.

Create Bucket.
  1. Click on the Upload button.

Upload sales_data.csv
  1. Upload your data - for example some sales data:

Windows - PowerShell

Linux

Workshops

Workshop
Key Skills

Sales Dashboard

joins, lookups, aggregations

Inventory Reconciliation

XML parsing, outer joins, variance

Customer 360

multi-source, JSONL, calculations

Clickstream Funnel

sessionization, pivoting

Log Parsing

regex, time-series analysis

Data Lake Ingestion

schema normalization, validation

1. Verify that MinIO is running and populated.

  1. Start Pentaho Data Integration.

Windows - PowerShell:

Linux:


Workshops

Sales Dashboard

Sales Dashboard

Follow the steps to create the transformation:

Text File Input

The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.

The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

Test File Inputs
  1. Drag & drop 3 Text File Input Steps onto the canvas.

  2. Save transformation as: sales_dashboard_etl.ktr in your workshop folder.


Sales (Order Management)

  1. Double-click on the first TFI step, and configure with the following properties:

Setting
Value

Step name

Sales

Filename

pvfs://Minio/raw-data/csv/sales.csv

Delimiter

,

Head row present

Format

mixed

Select - sales.csv from VFS connections
  1. Click: Get Fields to auto-detect columns.

Business Logic: Note that sale_amount may differ from price * quantity due to:

  • Volume discounts

  • Promotional pricing

  • Customer-specific pricing tiers

  • Currency conversion (for international sales)

Get Fields - Sales
  1. Preview data.

Preview data - Sales

Business Significance:

  • sale_amount: Actual revenue (may include discounts)

  • quantity: Volume metrics for demand planning

  • payment_method: Payment preference insights

  • status: Filter out cancelled/refunded orders


Products (ERP system)

  1. Double-click on the second TFI step, and configure with the following properties:

Setting
Value

Step name

Products

Filename

pvfs://Minio/raw-data/csv/products.csv

Delimiter

,

Head row present

Format

mixed

Select - products.csv from VFS connections
  1. Click: Get Fields to auto-detect columns.

Get Fields - Customers
  1. Preview the data.

Preview data - Products

Business Significance:

  • category: Enables product performance analysis by segment

  • price: Base pricing for margin calculations

  • stock_quantity: Inventory turnover insights


Customers (CRM System)

  1. Double-click on the third TFI step, and configure with the following properties:

Setting
Value

Step name

Customers

Filename

pvfs://MinIO/raw-data/csv/customers.csv

Delimiter

,

Header row present

Format

mixed

Select - customers.csv from VFS connections
  1. Click: Get Fields to auto-detect columns.

Get Fields - Customers
  1. Preview the data.

Preview data - Customers

Business Significance:

  • customer_id: Primary key for joining to sales

  • country: Critical for geographic segmentation

  • status: Identifies churned vs. active customers

  • registration_date: Enables customer tenure analysis

x

Last updated

Was this helpful?