PDI to Jupyter Notebook
Workshop - PDI to Jupyter Notebook
This workshop demonstrates how to create a Pentaho Data Integration (PDI) pipeline that processes sales data and automatically triggers analysis in Jupyter Notebook when the output file is saved.
The topics were going to cover:
Creating a Jupyter Notebook
Installing required Python packages:
jupyter,watchdog,xslxwriterCreate a PDI pipeline: sales_data.csv file
Create a File Watcher script

Quick overview of the pipeline:
Execute a PDI pipeline with sample sales_data.csv - from datasets folder
The file output to the pdi-output folder triggers the Jupyter Notebook to
Load the data - csv files from pdi-output - analyze and visualize the results
Export the results to the reports folder


Create a new Transformation
Any one of these actions opens a new Transformation tab for you to begin designing your transformation.
By clicking File > New > Transformation
By using the CTRL-N hot key
Quick Setup
To check the various scripts and that volume mappings are working, let's analyze a sample sales_data.csv:
Install some python packages
Load a sample dataset - test_sales_data.csv
Run the sales_analysis.ipynb - check container paths
Check ouput
Please ensure you have completed the following setup: Jupyter Notebook.
Remember the Jupyter Notebook is running in a Docker container ..!
To list / install python packages:
Once inside the container:
Install required Python packages:
Check for the test_sales_data.csv & sales_analysis.ipynb (still in container):
Open the sales_analysis.ipynb notebook and RUN each section:

Check for reports: C:\Jupyter-Notebook\reports\sales_analysis_timestamp.xlsx

Check you have 2 sheets: Summary & Detailed Data.
File Watcher
x
x
x
Data Pipeline
The data scientists have deployed the sales_analysis.ipynb notebook. The notebook will be triggered by a File Watcher that's polling the C:\Jupyter-Notebook\pdi-output for:
sales_detailed_*.csv
So in this part of the workshop, we're going to create a simple pipeline that:
Loads the sales.csv
Cleans and performs some calculations and aggregations
Outputs to: C:\Jupyter-Notebook\pdi-output folder.
Start Pentaho Data Integration.
Windows - PowerShell
Linux
Create a New Transformation:
CSV File input
The CSV File Input transform extracts data from delimited files using either a predefined schema or manually configured field layouts. Despite its name, this transform supports any delimiter—pipes, tabs, semicolons, or custom separators—not just commas.
Built for speed through optimized internal processing, this transform offers a focused subset of Text File Input capabilities with three key performance advantages:
Native I/O (NIO) uses direct system calls for faster file reading, though it's currently limited to local files without VFS support.
Parallel Processing enables distributed file reading when running multiple transform copies or in clustered mode. Each copy processes a separate file block, allowing workload distribution across multiple threads or slave nodes.
Lazy Conversion optimizes performance for pass-through data scenarios. When fields flow unchanged from input to output (like file-to-database transfers), this feature prevents unnecessary data type conversions, avoiding the overhead of converting raw data into strings, dates, or numbers.
While this transform has fewer configuration options than the general Text File Input transform, these performance optimizations make it ideal for high-throughput data processing workflows.
Drag & drop a CSV File input step onto the canvas.
Double-click on the step, and configure the following properties:
x
x
Data Validator
The Data Validator transform enables you to define validation rules that check input data across different fields in each row. When the validator encounters a row that violates one or more validation rules, it generates an error or exception.
You can capture all validation errors by configuring an error handling hop from this transform, which will provide you with a comprehensive list of any validation failures that occur during processing.
x
x
x
Replace in String
The Replace In String transform performs simple search and replace operations on String fields.
It also supports regular expressions and group references. group references are picked up in the replace by string as $n where n is the number of the group.
You can also use hexadecimal replacements for special characters.
x
x
x
Calculator
The Calculator transform offers built-in functions that operate on your input field values.
Beyond the function arguments (Field A, Field B, and Field C), you need to define the return type for each function.
After calculations are complete, you can exclude specific fields from the output using the Remove option. This feature helps eliminate temporary values that aren't needed in your final pipeline.
The Calculator delivers significantly faster execution compared to custom JavaScript scripts.
x
x
x
x
Group by
The Group By transform organizes rows from a data source according to one or more specified fields, creating a single row for each distinct group. Additionally, it can compute aggregate values like sums, averages, or counts for each group.
Typical applications include determining average sales figures by product category or tallying inventory quantities for each item type.
This step requires sorted input data to function properly. When working with unsorted data, only identical consecutive rows will be grouped together correctly. Furthermore, if data is sorted externally before entering the Transformation, differences in case sensitivity within the grouping fields may lead to unexpected results.
For scenarios involving unsorted input data, consider using the Memory Group By transform instead, which can handle data regardless of its initial order.
x
x
x
Text file output
The Text File Output transform exports data to text file formats, most commonly generating CSV files that can be opened in spreadsheet applications like Excel.
This transform also supports creating fixed-width files by specifying field lengths in the fields configuration tab. You have two options for defining the output structure: use an existing Schema Definition or manually configure the field layout. When working with a Schema Definition, pair this transform with the Schema Mapping transform to align your incoming data stream with the chosen schema structure.
x
x
x
x
Last updated
Was this helpful?
