PDI to Jupyter Notebook

Workshop - PDI to Jupyter Notebook

This workshop demonstrates how to create a Pentaho Data Integration (PDI) pipeline that processes sales data and automatically triggers analysis in Jupyter Notebook when the output file is saved.

The topics were going to cover:

Creating a Jupyter Notebook
Installing required Python packages: jupyter, watchdog, xslxwriter
Create a PDI pipeline: sales_data.csv file
Create a File Watcher script

Quick overview of the pipeline:

Execute a PDI pipeline with sample sales_data.csv - from datasets folder
The file output to the pdi-output folder triggers the Jupyter Notebook to
Load the data - csv files from pdi-output - analyze and visualize the results
Export the results to the reports folder

Create a new Transformation

Any one of these actions opens a new Transformation tab for you to begin designing your transformation.

By clicking File > New > Transformation
By using the CTRL-N hot key

Quick Setup

To check the various scripts and that volume mappings are working, let's analyze a sample sales_data.csv:

Install some python packages
Load a sample dataset - test_sales_data.csv
Run the sales_analysis.ipynb - check container paths
Check ouput

Please ensure you have completed the following setup: Jupyter Notebook.

Remember the Jupyter Notebook is running in a Docker container ..!

To list / install python packages:

cd \
docker exec -it jupyter-datascience bin/bash

Once inside the container:

pip list - will list the installed packages

Install required Python packages:

cd \
docker exec -it jupyter-datascience bash
pip install jupyter watchdog xlsxwriter

Check for the test_sales_data.csv & sales_analysis.ipynb (still in container):

cd
cd /home/jovyan/datasets
ls

cd
cd /home/jovyan/notebooks
ls

Open the sales_analysis.ipynb notebook and RUN each section:

Check for reports: C:\Jupyter-Notebook\reports\sales_analysis_timestamp.xlsx

Check you have 2 sheets: Summary & Detailed Data.

Data Pipeline

The data scientists have deployed the sales_analysis.ipynb notebook. The notebook will be triggered by a File Watcher that's polling the C:\Jupyter-Notebook\pdi-output for:

sales_detailed_*.csv

So in this part of the workshop, we're going to create a simple pipeline that:

Loads the sales.csv
Cleans and performs some calculations and aggregations
Outputs to: C:\Jupyter-Notebook\pdi-output folder.

Start Pentaho Data Integration.

Windows - PowerShell

Set-Location C:\Pentaho\design-tools\data-integration
.\spoon.bat

Linux

cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh

Create a New Transformation:

CSV File input

The CSV File Input transform extracts data from delimited files using either a predefined schema or manually configured field layouts. Despite its name, this transform supports any delimiter—pipes, tabs, semicolons, or custom separators—not just commas.

Built for speed through optimized internal processing, this transform offers a focused subset of Text File Input capabilities with three key performance advantages:

Native I/O (NIO) uses direct system calls for faster file reading, though it's currently limited to local files without VFS support.

Parallel Processing enables distributed file reading when running multiple transform copies or in clustered mode. Each copy processes a separate file block, allowing workload distribution across multiple threads or slave nodes.

Lazy Conversion optimizes performance for pass-through data scenarios. When fields flow unchanged from input to output (like file-to-database transfers), this feature prevents unnecessary data type conversions, avoiding the overhead of converting raw data into strings, dates, or numbers.

While this transform has fewer configuration options than the general Text File Input transform, these performance optimizations make it ideal for high-throughput data processing workflows.

Drag & drop a CSV File input step onto the canvas.
Double-click on the step, and configure the following properties:

PreviousJupyter Notebook NextEnrich Data

Last updated 2 months ago

Was this helpful?