Kafka

Use Case - Logistics revisited

Pentaho Data Integration

Kafka-docker-compose is a tool or method that allows you to easily configure and set up Apache Kafka along with its components such as Kafka Brokers, ZooKeeper, Kafka Connect, and more in a Docker environment. Using docker-compose, you can define and run multi-container Docker applications where each service (like a Kafka broker or ZooKeeper) is defined in a docker-compose.yml file.

This approach simplifies the complexities of network configurations between these services and ensures that you have a reproducible and isolated environment for development, testing, and potentially production scenarios. It allows for easy scaling of Kafka brokers and other services within your cluster.

The kafka-docker-compose tool requires Python3 and jinja2 installed.

Create a Kafka directory.

cd
mkdir -p ~/Kafka

Follow the instructions in the link below to create

Lets start with a simple cluster that consists of: 1 Broker & 1 Controller - Zookeeper mode.

Execute the following command (adds JMX agents for Prometheus & Grafana).

cd
cd ~/Kafka
python3 kafka_docker_composer.py -b 1 -z 1 -p

Change the following values in the generated docker-compose.yml file:

KAFKA_CONFLUENT_CLUSTER_LINK_ENABLE: Null
KAFKA_CONFLUENT_REPORTERS_TELEMETRY_AUTO_ENABLE: Null

Execute the generated docker-compose.yml file.

cd
cd ~/Kafka
docker-compose up -d

KaDeck

KaDeck is a specialized tool designed for working with Apache Kafka, offering a user-friendly interface for Kafka monitoring, management, and data exploration. It serves as a comprehensive client that allows developers, data engineers, and operations teams to interact with their Kafka clusters more efficiently.

The tool provides real-time visibility into Kafka topics, consumer groups, and messages, allowing users to browse and search through data streams with advanced filtering capabilities. This makes troubleshooting and debugging significantly easier compared to command-line alternatives. KaDeck also offers features for monitoring cluster performance, analyzing consumer lag, and visualizing message flow throughout the system.

One of KaDeck's key strengths is its ability to decode various message formats automatically (including Avro, JSON, and Protocol Buffers), presenting the data in a structured, readable format. The tool supports both cloud-based and on-premises Kafka deployments, making it versatile for different enterprise environments. For teams working extensively with event streaming platforms, KaDeck helps bridge the gap between technical Kafka operations and business-relevant data insights.

Try KaDeck for free.!

Run the following command - Change the ports to prevent conflict.

docker run -d -e xeotek_kadeck_free="<your_email_address>" -e xeotek_kadeck_port=80 -e xeotek_kadeck_secret="" -e xeotek_kadeck_teamid="" -p 80:80 --name kadeck -v kadeck_data:/root/.kadeck/ xeotek/kadeck:6.0.1

Open the Lenses HQ at: http://localhost:8070

Username: admin

Password: admin

We'll generate IoT sensor data using PDI.

Start Pentaho Data Integration:

cd
cd Pentaho/design-tools/data-integration
sh spoon.sh

Kafka Producer

The Kafka Producer step allows you to publish messages in near-real-time to an Kafka broker. Within a transformation, the Kafka Producer step publishes a stream of records to a Kafka topic.

Open the following transformation:

~/Workshop--Data-Integration/Labs/Module 3 - Data Sources/Streaming Data/04 Kafka/tr_kafka_producer.ktr

just for 1 vehicle_id 111 - every 5 seconds
timestamp added
remove some fields
javascript to generate sensor data
dummy step to collect data streams
concat the fields into a 'message' payload
Kafka Producer - connect to broker & publish message / payload

Double-click on Kafka producer step and configure with the following settings.

Setup

Option

Description

Connection

Select a connection type:

Direct: Specify the Bootstrap servers from which you want to receive the Kafka streaming data.

Cluster: Specify the Hadoop cluster configuration from which you want to retrieve the Kafka streaming data. In a Hadoop cluster configuration, you can specify information like host names and ports for HDFS, Job Tracker, security, and other big data cluster components. Multiple servers can be specified if these are part of the same cluster.

Client ID

The unique Client identifier, used to identify and set up a durable connection path to the server to make requests and to distinguish between different clients.

Topic

The category to which records are published.

Key Field

In Kafka, all messages can be keyed, allowing for messages to be distributed to partitions based on their keys in a default routing scheme. If no key is present, messages are randomly distributed to partitions.

Message Field

The individual record contained in a topic.

Options

The Options tab enables you to secure the connection to the broker.

PreviousKafka NextMachine Learning

Was this helpful?

Kafka

Use Case - Logistics revisited

Pentaho Data Integration

KaDeck

Kafka Producer

Setup

Options

Kafka Consumer

Get Records