bucketStorage

Setup Object Stores & SMB ..

circle-info

Object Stores

Object storage systems like Amazon S3 and MinIO provide a way to store and retrieve large amounts of unstructured data such as files, images, videos, and backups through a simple web-based API. Unlike traditional file systems that organize data in hierarchical folders, object stores use a flat namespace where each piece of data (called an object) is stored in containers called buckets and accessed via unique keys or URLs.

Amazon S3 is AWS's flagship object storage service that offers virtually unlimited scalability, multiple storage classes for different use cases, and integration with other AWS services.

MinIO is an open-source alternative that provides S3-compatible APIs and can be deployed on-premises or in private clouds, making it popular for organizations that want object storage capabilities without vendor lock-in.

Both systems are designed for high durability, availability, and can handle massive scale while providing simple REST API access for applications to store and retrieve data programmatically.

Storage
triangle-exclamation
circle-info

Prerequisites

  • Ubuntu 24.04 LTS system (physical or virtual machine)

  • User account with sudo privileges

  • Internet connection

  • Basic familiarity with Linux command line

circle-info

MinIO

Follow the instructions below to setup a MinIO Docker Container.

Select your OS & add the Sample Data, finally configure a VFS connection in Data Integration:

circle-info

Installs and configures MinIO on Ubuntu 24.04 running in Docker.

  1. Create a MinIO folder and copy the required files.

circle-info

Create directory & copy

copy-minio.sh
  1. Ensure all the files have successfully been copied over.

  1. Execute the docker-compose script to create the container.

circle-info

MinIO Container

run-docker-minio.sh
circle-exclamation
  1. Check the container is up and running in Docker.

Check Docker minio container

  1. Log into MinIO.

Username: minioadmin

Password: minioadmin

circle-info

If you have completed the setup: MinIO then you should have pre-populated buckets with various data objects in different formats.


New Bucket

If you need to create a Bucket:

  1. Click the 'Create Bucket' link.

  2. Enter: sales-data & 'Create Bucket'.

Create Bucket.
  1. Click on the Upload button.

Upload sales_data.csv
  1. Upload your data - for example some sales data:

circle-info

Windows - PowerShell

circle-info

Linux

circle-exclamation

Workshops

Workshop
Key Skills

Sales Dashboard

joins, lookups, aggregations

Inventory Reconciliation

XML parsing, outer joins, variance

Customer 360

multi-source, JSONL, calculations

Clickstream Funnel

sessionization, pivoting

Log Parsing

regex, time-series analysis

Data Lake Ingestion

schema normalization, validation

1. Verify that MinIO is running and populated.

  1. Start Pentaho Data Integration.

circle-info

Windows - PowerShell:

circle-info

Linux:


Workshops

circle-exclamation

Sales Dashboard

Sales Dashboard

Follow the steps to create the transformation:

circle-info

Text File Input

The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.

The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

Test File Inputs
  1. Drag & drop 3 Text File Input Steps onto the canvas.

  2. Save transformation as: sales_dashboard_etl.ktr in your workshop folder.


Sales (Order Management)

  1. Double-click on the first TFI step, and configure with the following properties:

Setting
Value

Step name

Sales

Filename

pvfs://Minio/raw-data/csv/sales.csv

Delimiter

,

Head row present

Format

mixed

Select - sales.csv from VFS connections
  1. Click: Get Fields to auto-detect columns.

circle-info

Business Logic: Note that sale_amount may differ from price * quantity due to:

  • Volume discounts

  • Promotional pricing

  • Customer-specific pricing tiers

  • Currency conversion (for international sales)

Get Fields - Sales
  1. Preview data.

Preview data - Sales
circle-info

Business Significance:

  • sale_amount: Actual revenue (may include discounts)

  • quantity: Volume metrics for demand planning

  • payment_method: Payment preference insights

  • status: Filter out cancelled/refunded orders


Products (ERP system)

  1. Double-click on the second TFI step, and configure with the following properties:

Setting
Value

Step name

Products

Filename

pvfs://Minio/raw-data/csv/products.csv

Delimiter

,

Head row present

Format

mixed

Select - products.csv from VFS connections
  1. Click: Get Fields to auto-detect columns.

Get Fields - Customers
  1. Preview the data.

Preview data - Products
circle-info

Business Significance:

  • category: Enables product performance analysis by segment

  • price: Base pricing for margin calculations

  • stock_quantity: Inventory turnover insights


Customers (CRM System)

  1. Double-click on the third TFI step, and configure with the following properties:

Setting
Value

Step name

Customers

Filename

pvfs://MinIO/raw-data/csv/customers.csv

Delimiter

,

Header row present

Format

mixed

Select - customers.csv from VFS connections
  1. Click: Get Fields to auto-detect columns.

Get Fields - Customers
  1. Preview the data.

Preview data - Customers
circle-info

Business Significance:

  • customer_id: Primary key for joining to sales

  • country: Critical for geographic segmentation

  • status: Identifies churned vs. active customers

  • registration_date: Enables customer tenure analysis

x

Last updated

Was this helpful?