Run Configurations

Execute Jobs / Transformations on specific nodes or in a Pentaho Cluster ..

Pentaho Data Integration provides advanced clustering and partitioning capabilities that allow organizations to scale out their data integration deployments.

In this guided demonstration, you will:

• Configure Master & Slave Nodes

• Execute RUN Configurations

So lets start scaling out by adding some servers (nodes).

These can be defined as either:

• Master: node is responsible for distributing work among the worker nodes and ensuring high availability and scalability of the system.

• Slave (Worker): node in Pentaho is an instance that can execute Pentaho work items, such as PDI jobs and transformations, with parallel processing, dynamic-scalability, load-balancing, and dependency-management in a clustered environment

Master Node

You can indiviually start the carte instances or execute the following command to deploy all 3 at the same time .

cd
cd ~/Scripts
./start_carte.sh

In a terminal execute the following command.

cd
cd ~/Pentaho/design-tools/data-integration
sh carte.sh localhost 12000

Slave Nodes

In a new terminal execute the following command (Slave A).

cd
cd ~/Pentaho/design-tools/data-integration
sh carte.sh localhost 12100

In a new terminal execute the following command (Slave B).

cd
cd ~/Pentaho/design-tools/data-integration
sh carte.sh localhost 12200

You should now have 3 terminals, each running a Carte instance.

Please dont close the terminals ..!

Now we have our 3 nodes up and running, lets configure some RUN configurations to execute our Transformations on specific nodes.

Some ETL activities are lightweight, such as loading in a small text file to write out to a database or filtering a few rows to trim down your results. For these activities, you can run your transformation locally using the default Pentaho engine.

Some ETL activities are more demanding, containing many steps calling other steps or a network of transformation modules. For these activities, you can set up a separate Pentaho Server dedicated for running transformations using the Pentaho engine.

Other ETL activities involve large amounts of data on network clusters requiring greater scalability and reduced execution times. For these activities, you can run your transformation using the Spark engine in a Hadoop cluster.

Pentaho local is the default run configuration. It runs transformations with the Pentaho engine on your local machine. You cannot edit this default configuration.

Ensure you have configured the Nodes.

To create a new run configuration, right-click on 'Run configurations' folder and select New.

Enter the following configuration details, ensuring that you select the Pentaho (KETTLE) engine.

When you come to RUN the transformation, select Master node.

As you can see from the Results:

Transformation is executed on the Master Node
As Monitor tab displays the Step Metrics

Take a look at the Master Terminal.

Give it a go with other RUN configurations .. Just Slave A / B

A cluster schema is essentially a collection of slave servers. In each schema, you need to pick at least one slave server that we will call the Master slave server or master.

The master is also just a carte instance but it takes care of all sort of management tasks across the cluster schema. In the Spoon GUI, you can enter this metadata as well once you started a couple of slave servers.

The workflow in a clustered Pentaho transformation is as follows:

• The job entry or the transformation connects to the cluster master node, which is responsible for coordinating the execution of the transformation steps on the cluster slave nodes.

• The master node sends the transformation metadata and the cluster schema to the slave nodes, and assigns each step to one or more nodes based on the cluster schema.

• The slave nodes execute the assigned steps and exchange data with each other using sockets or shared files, depending on the partitioning method and the clustering plugin used.

• The master node monitors the progress and status of the slave nodes, and collects logging information and performance metrics from them.

• The master node reports the outcome of the transformation execution to the job entry or the transformation that initiated it.

Cluster Schema

To create a new run configuration, right-click on 'Kettle cluster schemas' folder and select New.

Enter the following configuration details.

Option

Description

Schema name

The name of the clustering schema

Port

Specify the port from which to start numbering ports for the slave servers. Each additional clustered step executing on a slave server will consume an additional port. Note: To avoid networking problems, make sure no other networking protocols are in the same range.

Sockets buffer size

The internal buffer size to use

Sockets flush interval rows

The number of rows after which the internal buffer is sent completely over the network and emptied.

Sockets data compressed?

When enabled, all data is compressed using the Gzip compression algorithm to minimize network traffic

Dynamic cluster

If checked, a master Carte server will perform failover operations, and you must define the master as a slave server in the field below. If unchecked, Spoon will act as the master server, and you must define the available Carte slaves in the field below.

Slave Servers

A list of the servers to be used in the cluster. You must have one master server and any number of slave servers. To add servers to the cluster, click Select slave servers to select from the list of available slave servers.

To create a new run configuration, right-click on 'Run configurations' folder and select New.
Enter the following configuration details, ensuring that you select the Pentaho (KETTLE) engine.