Kettle EE plugins
Installation of EE plugins ..
Pentaho Data Integration Plugin Manager
Pentaho Data Integration (PDI) can be extended with plugins that add new steps, job entries, and other functionality. The best way to manage these plugins is through the Plugin Manager, which you'll find in both the PDI client and Pentaho User Console (PUC).
The Plugin Manager handles all your plugin needs: installing new ones, updating existing ones to their latest versions, and removing plugins you no longer use.
While you can install plugins manually, this approach isn't recommended. Manually installed plugins won't show up in the Plugin Manager, which means you'll have to handle all future updates and removals yourself.
x
x
x
HDT
A hierarchical data type can be used to store and query data that is organized in a tree-like fashion, such as organizational charts, file systems, or taxonomies.
In the top toolbar Select: Tools > Plugin Manager.

Installing a Plugin: Find the plugin you want to install by searching or browsing the available options.
For the latest version: Simply click Install.

For an earlier version: Click on the plugin's table row to open the Plugin name dialog box. Select your desired version from the dropdown list and click Install. Confirm the installation if prompted.
Restart to activate: After installation, restart both Pentaho Server & PDI client. This step is essential - newly installed plugins won't work until you restart.
Verify the installation: Log into the PDI client and navigate to Tools > Plugin Manager. Search for or browse to your newly installed plugin. Check the Installed Version column to confirm the correct version is listed.
Kafka
Reconfiguring multiple Docker Compose files for different scenarios can become very tedious. To solve this, you can use a Python script that generates the docker-compose file: Kafka-Docker-Composer.
x
Deploy Kafka
The application kafka_docker_composer.py takes a list of arguments and determines how the template should be populated. It creates the dependencies between the different components, ensures that names and ports are unique, sets up advertised listeners correctly, and ensures that dependent services like Schema Registry of Kafka Connect point to the corresponding Confluent Server brokers.
Simple Cluster
ZooKeeper is deprecated; therefore, modern versions of Kafka prefer KRaft.
1 Broker
1 KRaft Controller
Generate the docker compose yaml.
Run the docker compose.
Check the containers.
To bring down the containers & remove volumes.
Realistic Cluster
3 Brokers
3 KRaft Controllers
Prometheus - port:9090
Grafana - port:3000 admin/adminpass
Schema Registry
Connector instance - 2
Confluent Control Center - version 7.9.5
Generate the docker compose yaml.
Run the docker compose.
Check the containers.
Verify the deployment.

To bring down the containers & remove volumes.
Quick Commands
Control Center
Grafana
The user and password are set to “admin/adminpass”, you can change in volumes/config.ini.
Prometheus
There are separate dashboards for ZooKeeper and KRaft controllers as indicated by their names.
The exporter configuration files, dashboards, and the exporter jar are in the volumes directory, so you do not have to download anything separately.
It takes a few minutes for JMX exporters to start up. Check the Status/Targets page in Prometheus to see if your metrics scrapes succeeded.
Schema Registry API
x
x
Datagen Connector
The Confluent Datagen Source Connector (kafka-connect-datagen) is a Kafka Connect plugin designed to generate mock data for development and testing purposes. It is important to note upfront that this connector is not intended for production use - it exists purely to help developers and testers simulate real data flowing through a Kafka pipeline.
The connector uses Avro Random Generator under the hood to define the "rules" for the mock data it produces. You specify an Avro schema via a quickstart template or a custom schema file, and the connector continuously produces records to a Kafka topic based on that schema. Confluent ships several pre-built quickstart schemas out of the box, including pageviews, users, orders, and stock trades, making it easy to get started without writing your own schema right away.
It supports multiple output formats including Avro, JSON Schema, Protobuf, and schemaless JSON, giving you flexibility depending on whether you're working with Confluent Schema Registry or not.
Once running, the connector continuously streams generated records into a Kafka topic, which you can then consume with any Kafka client - such as a standard Kafka consumer, a Kafka Streams application, or ksqlDB - making it a great tool for building out and validating your downstream Kafka pipelines before real data is available.
Deploy the datagen connectors.

Verify data is flowing to the Kafka Clsuter - 5 messages.
Log into the Control Center.


So you now have data streaming to various topics at different rates via the Datagen connector.

Deploy the MySQL database using make - uses
docker-compose-mysql.yml

What happens:
MySQL 8.0 container starts
Database
kafka_warehousecreated automaticallyUser
kafka_usercreated with passwordkafka_passwordAll tables, views, and stored procedures created automatically
Data persisted in Docker volume
Verify deployment.

Connect to MySQL.
Databricks
The Bulk load into Databricks entry loads large volumes of data from cloud storage files directly into Databricks tables. How it works: It accomplishes this by using Databricks' COPY INTO command behind the scenes.
Salesforce Bulk Operation
The Salesforce bulk operation step performs large-scale data operations (insert, update, upsert, and delete) on Salesforce objects using the Salesforce Bulk API 2.0.
How it works: The step reads data from an input stream, creates a CSV file of the changes, and executes the bulk job against Salesforce. After the job completes, you can optionally route three types of results to separate output streams: successful records, unprocessed records, and failed records.
Requirements: You must have a Salesforce Client ID and Client Secret to use this step.
Google Analytics v4
The Google Analytics v4 step retrieves data from your Google Analytics account for reporting or data warehousing purposes.
How it works: The step queries Google Analytics properties through the Google Analytics API v4 and sends the resulting dimension and metric values to the output stream.
Pentaho supports a hierarchical data type (HDT) through the Pentaho EE Marketplace plugin. This plugin adds the HDT data type and includes five specialized steps for working with it.
What it does: These steps simplify working with complex, nested data structures. They can convert between HDT fields and formatted strings, and let you directly access or modify nested array indices and keys.
Performance benefits: The steps significantly improve performance compared to handling hierarchical data as plain strings.
Data structure: HDT can store nested or complex data built from objects and arrays, as well as single elements. It's compatible with any PDI step that processes hierarchical data.
Kafka Job
Last updated
Was this helpful?
