Read DB table
Workshop - Read DB Table
Database tables serve as the foundational data source for most enterprise data integration workflows—from transactional systems and operational databases to data warehouses and analytical platforms. Organizations rely on efficient data extraction from relational databases to power their reporting, analytics, and downstream data processing pipelines. Understanding how to read data from database tables using SQL queries is fundamental for building data integration solutions that transform raw database records into actionable business insights.
In this hands-on workshop, you'll learn to use PDI's "Table Input" step to extract data from relational database tables using SQL queries. The warehouse manager at Steel Wheels requires a report highlighting the status of shipped orders, and you'll build a complete transformation that retrieves order data, performs calculations, categorizes results into ranges, and sorts the output for analysis. You'll discover how to write and modify SQL statements, leverage automatic SQL generation, preview query results, and chain multiple transformation steps together to create meaningful business reports from raw database records.
What You'll Accomplish:
Configure the Table Input step to connect to database tables
Write SQL queries to filter and retrieve specific data subsets
Use the "Get SQL select statement" feature to automatically generate SQL
Preview query results before executing the full transformation
Modify SQL statements with WHERE clauses to filter data by status
Integrate Calculator steps to perform computations on database fields
Apply Number Range steps to categorize numeric values into descriptive ranges
Implement Sort Rows steps to organize data for reporting
Build a multi-step transformation that processes database records end-to-end
Understand SQL query optimization and parameterization techniques
By the end of this workshop, you'll have practical experience extracting data from database tables and processing it through a complete transformation pipeline. You'll understand how the Table Input step serves as the entry point for database-driven transformations and how to combine it with other steps to create sophisticated data processing workflows. Rather than exporting data manually or writing standalone SQL scripts, you'll build native PDI solutions that seamlessly integrate database queries with transformation logic, enabling repeatable, automated data processing that delivers consistent business value.
Prerequisites: Understanding of basic transformation concepts, database connection configuration, familiarity with SQL SELECT statements and WHERE clauses; Pentaho Data Integration installed and configured with appropriate database connections established
Estimated Time: 20 minutes

Create a new Transformation
Any one of these actions opens a new Transformation tab for you to begin designing your transformation.
By clicking File > New > Transformation
By using the CTRL-N hot key
Table input
This step is used to read information from a database, using a connection and SQL. Basic SQL statements can be generated automatically by clicking Get SQL select statement.
Connects to the ORDERS data table and extracts the required dataset where the status of the order is ‘Shipped’.
Start Pentaho Data Integration.
Windows - PowerShell
Linux
Drag the Table Input step onto the canvas.
Open the Table Input properties dialog box. Ensure the following details are configured, as outlined below:

Preview and Click OK.
Calculator
This calculator step provides you with predefined functions that can be executed on input field values.
The execution speed of the Calculator is far better than the speed provided by custom scripts (JavaScript).
Besides the arguments (Field A, Field B and Field C) you must also specify the return type of the function. You can also choose to remove the field from the result (output) after all values are calculated; this is useful for removing temporary values.
Drag the Calculator step onto the canvas.
Open the Calculator properties dialog box.
Ensure the following details are configured, as outlined below:

Click OK.
Calculates order_time between when the order is required from when it was shipped.
Number range
The Number Range step in Pentaho Data Integration (PDI) maps numeric values to descriptive categories or ranges. This step takes numbers from an input field and assigns text descriptions based on which predefined range they fall into.
To use this step, you add it to your transformation, specify the input numeric field and output text field, then define your ranges with lower and upper bounds along with the corresponding description for each range. For example, you might categorize ages into groups like "Under 18," "18-24," "25-34," and so on.
This functionality is particularly valuable for data preparation, reporting, and visualization as it converts continuous numerical data into discrete, meaningful categories that are easier to analyze and understand. Common applications include creating age brackets, income ranges, temperature classifications, or performance tiers.
Drag the Number range step onto the canvas.
Open the Number ranges properties dialog box.
Ensure the following details are configured, as outlined below:

Sets the actions for ‘order_time’ in the output field ‘order_status’:
On Time
If the order has been Shipped within 2 days or more before the ORDERDATE.
10%
If the order has been Shipped more than 3 days after the ORDERDATE.
20%
If the order has been shipped more than 4 days after the ORDERDATE.
Action
If the order has been shipped later than 5 days after the ORDERDATE.
Sort rows
The Sort rows step sorts rows based on the fields you specify and on whether they should be sorted in ascending or descending order.
Kettle must sort rows using temporary files when the number of rows exceeds the specified sort size (default 1 million rows). When you get an out of memory exception (OOME), you need to lower this size of change your available memory.
When you use multiple copies of the step in parallel (on the local JVM with "Change number of copies to start" or in a clustered environment using Carte) each of the sorted blocks need to be merged together to ensure the proper sort sequence. This can be done, be adding the Sorted Merge step afterwards (on the local JVM without multiple copies to start or in the cluster on the master).
Drag the Sort Rows step onto the canvas.
Open the Sort Rows properties dialog box.
Ensure the following details are configured, as outlined below:

Click OK.
Before you can do any stream operations, its best practice to Sort the rows.
Select values
The Select Values step is useful for selecting, removing, renaming, changing data types and configuring the length and precision of the fields on the stream. These operations are organized into different categories:
Select and Alter
Specify the exact order and name in which the fields should be placed in the output rows.
Remove
Specify the fields that should be removed from the output rows.
Meta-data
Change the name, type, length and precision (the metadata) of one or more fields.
Drag the Select values step onto the canvas.
Open the Select values properties dialog box. Ensure the following details are configured, as outlined below:

Click OK.
Formats the REQUIREDDATE and SHIPPEDDATE
RUN
This transformation introduces several new steps that can help manipulate the data with predefined functions.
Click the Run button in the Canvas Toolbar.
Click the Preview Data tab for Select values..

Last updated
Was this helpful?
