LLM Integration
Integrate LLMs into your pipelines ..
Overview
The workshops establish a repeatable architectural pattern: PDI acts as the orchestration layer, while a locally-hosted LLM (served via Ollama) handles the "intelligent" processing that rule-based ETL cannot do well. The six use cases covered - Sentiment Analysis, Data Quality, Data Enrichment, Named Entity Recognition, text summarization and multi-staged - all share the same fundamental skeleton. Once you understand it once, you can adapt it to virtually any AI-enrichment task.
Everything hinges on making a clean POST request to Ollama's /api/generate endpoint at http://localhost:11434. This is a local REST API, so there's no external dependency, no API key, and no network latency.
The request payload structure is:
{
"model": "llama3.2:3b",
"prompt": "Your instruction here...",
"stream": false,
"format": "json",
"keep_alive": "30m",
"options": {
"temperature": 0.1,
"num_predict": 300,
"num_ctx": 2048,
"num_thread": 0
}
}Parameters:
model — Which Ollama model to invoke. The workshops use llama3.2:3b as the default: compact enough to run on CPU-only hardware but capable enough for structured tasks. You can scale up to llama3.1:8b for harder problems at the cost of speed.
stream: false — This is critical for PDI. Streaming sends tokens incrementally, which is great for chat UIs but useless in an ETL row-by-row context. Setting this to false tells Ollama to generate the entire response, then return it as one complete JSON object — which is what PDI needs to write into a field.
format: "json" — Forces the model's output to be valid JSON. Without this, the model might add conversational preamble, markdown fences, or explanations around its JSON answer, all of which break downstream parsing.
keep_alive — This is the single biggest performance optimization for batch processing. By default Ollama unloads the model from memory after each request. Reloading takes 10–30 seconds depending on model size and disk speed. Setting keep_alive: "30m" keeps the model resident in RAM across all rows in a batch run, turning a multi-hour job into a fraction of that time.
temperature: 0.1 — Controls randomness. Values near 0 produce near-deterministic output, meaning the same review will produce the same sentiment classification on repeated runs. This is essential for reproducible ETL — you don't want results changing every time the pipeline runs.
num_predict: 300 — Caps the maximum tokens generated. This prevents runaway generation, protects processing time, and avoids unexpectedly large response payloads overwhelming downstream steps.
When PDI's REST Client step receives the Ollama response, it comes back as a single JSON object with two categories of fields: the content fields that contain the actual result, and the performance/telemetry fields that tell you how the inference went.
The response field is a string containing JSON, not a nested object - which is why the pipeline requires two separate JSON Input steps: one to extract $.response from the Ollama wrapper, and a second to parse the actual model output into structured fields.
The response payload structure is:
Content Fields
model — Confirms which model actually processed the request. This is useful as a sanity check, especially in pipelines where the model name is injected via a PDI variable. If you're getting unexpected results, checking this field confirms whether the right model was used.
created_at — The UTC timestamp of when the response was generated. Useful for logging and auditing, particularly if you're persisting results to a database and need to track when the enrichment was applied.
response — This is the only field you actually care about for downstream processing. It contains the LLM's generated output as a string. Importantly, even when you pass "format": "json" in the request, Ollama still wraps the model's JSON output inside this string field rather than embedding it as a native JSON object. That's why the pipeline needs a JSON Input step to extract $.response first, then a second parsing pass to read the actual model output.
done — A boolean that signals whether generation completed. When true, the response is complete. When false, the model was interrupted or is still streaming (which shouldn't happen in PDI since you set stream: false, but worth checking during error handling).
done_reason — Explains why generation stopped. The value "stop" means the model reached a natural conclusion — it finished what it was saying. Other possible values include "length" (generation was cut off because it hit the num_predict token limit) and "load" (model was freshly loaded). If you see "length" in production, it means your responses are being truncated and you need to increase num_predict.
Performance / Telemetry Fields
All duration values are expressed in nanoseconds, so divide by 1,000,000,000 to get seconds.
total_duration — The wall-clock time for the entire request from when Ollama received it to when it returned the response. This is what you'd use to track end-to-end latency per record. In the sentiment analysis example, this was ~3.1 seconds for a short review.
load_duration — How long it took to load the model into memory before inference began. When keep_alive is working correctly and the model is already resident in RAM, this value drops to near zero for subsequent requests. If you see load_duration remaining high across all rows, it means the model is being reloaded on every request, which is a sign your keep_alive setting isn't being applied.
prompt_eval_count — The number of tokens in your input prompt. This is the tokenized length of everything you sent to the model. Monitoring this field is important for prompt optimization — the workshops show how cutting prompt length by 50% cuts processing time proportionally. If prompt_eval_count is unexpectedly high, your prompt likely has verbose instructions that can be trimmed.
prompt_eval_duration — How long the model spent processing (encoding) your prompt before it started generating the response. This is the "reading the question" phase. It's typically much shorter than eval_duration.
eval_count — The number of tokens in the generated response. This is your output token count. If this number is hitting your num_predict ceiling (e.g., exactly 300 when you set num_predict: 300), that's a strong signal that your responses are being truncated and you need to raise the limit or shorten your expected output schema.
eval_duration — How long the model spent actually generating the response tokens. This is almost always the dominant cost in total_duration. On CPU-only hardware, token generation speed is the bottleneck — typically the workshops see about 2.1 seconds of generation time for an 18-token response, which gives you a rough tokens-per-second figure for capacity planning.
context — An array of integers representing the internal token IDs that make up the conversation state. This is the model's "memory" of the exchange encoded as token references. For single-turn ETL tasks you can completely ignore this field. It only becomes relevant if you're building multi-turn conversation pipelines where you want to pass context back to the model in subsequent requests to maintain continuity.
Anatomy of a Good Prompt
The prompt is the only interface between your ETL data and the LLM. PDI handles reading, routing, and writing data - but the quality of what the model returns is entirely determined by how well you wrote the prompt. A vague prompt produces unpredictable output that breaks your JSON parser. A well-engineered prompt produces the same structured response every time, making the downstream pipeline reliable.
Using the sentiment analysis workshop as a concrete example, here's the prompt that gets constructed inside the Modified JavaScript Value step for each row:
And the PDI JavaScript that builds it:
Every element of this prompt is doing specific work.
The instruction is a verb, not a question.** "Analyze sentiment" is a direct command. Phrasing it as "Can you tell me the sentiment of this review?" wastes tokens and invites a conversational response rather than structured output.
The data is clearly delimited.** Wrapping the review text in quotes separates it visually and semantically from the instruction. Without this, the model can conflate the instruction with the data, especially for reviews that contain imperative language.
The schema is shown, not described.** Rather than writing "return a JSON object with a field called sentiment that can be positive, negative, or neutral", the prompt shows the exact JSON structure. The model pattern-matches against the example and fills in the values, which is far more reliable than parsing a verbal description of what you want.
Value constraints are embedded inline.** Specifying -1.0 to 1.0 for score and 0-100 for confidence directly in the schema means you don't need a separate instruction section. The model sees the range at the exact point where it needs to apply it.
The Cost of Verbosity
The Data Quality workshop makes this concrete. Here's the verbose version of a data cleaning prompt:
The verbose version uses roughly 120 tokens. The optimized version uses around 60. On a batch of 1,000 records running on CPU-only hardware at 23 seconds per record, that difference compounds into real time savings. The model doesn't need to be spoken to politely - it needs to be spoken to precisely.
Gotcha 1: ${VARIABLE} syntax doesn't work inside JavaScript strings.
This is the most common failure mode in the workshops and it produces a completely silent error. When you write:
PDI does not substitute the variable. The string ${MODEL_NAME} is sent literally to Ollama, which either returns a 400 Bad Request or — worse — tries to find a model named ${MODEL_NAME} and fails silently. The correct approach is always getVariable():
The ${PARAM} syntax only works in XML-based step configuration fields, not in JavaScript code blocks.
Gotcha 2: The model adds text around your JSON even with format: "json".
Even when you set "format": "json" in the Ollama request, some models — particularly on certain prompts — will wrap their JSON in markdown code fences or add a sentence before it:
If your JSON Input step tries to parse the full response field directly, it will fail because the surrounding text makes it invalid JSON. The defensive fix used in the workshops is to scan for the first { and last } in the response string and extract only that substring:
This makes your parser robust to model verbosity regardless of which model or version you're running.
Gotcha 3: Silent truncation when num_predict is too low.
If the model's response hits the num_predict token ceiling mid-generation, Ollama stops and returns whatever was generated so far. The done_reason field will say "length" instead of "stop", but if you're not logging that field, you'll never know.
The symptom is malformed JSON arriving at your parser - the object opens but never closes - which your error handler catches and routes to the failure path. The fix is to monitor eval_count in your output and raise num_predict if it consistently hits the ceiling.
Gotcha 4: Asking for too many fields degrades accuracy.
It's tempting to extract everything you might ever want in a single prompt. The NER workshop covers 10 entity types (PERSON, ORGANIZATION, LOCATION, DATE, PRODUCT, MONEY, CONTACT, ID, TECHNOLOGY, POSITION) simultaneously.
This works for NER because entity classification is a well-defined task. But for analytical prompts - asking for sentiment, tone, intent, key phrases, competitive mentions, urgency, and a summary all at once - models tend to fill in fields speculatively rather than accurately. If accuracy matters more than throughput, it's worth splitting complex prompts into focused single-purpose calls, even at the cost of additional API round trips.
Gotcha 5: Not escaping special characters in the source data.
When review text or customer data contains quotes, backslashes, or newlines, they can break your JavaScript string concatenation and produce a malformed JSON payload before it even reaches Ollama. A review like "Best product I've ever bought — 5 stars!" is fine, but one containing "He said \"amazing\" and I agree" will break the string if not handled. The fix is to sanitize input text before embedding it in the prompt:
This is especially important for free-text fields pulled from customer-facing systems where you have no control over what users type.
Select a workshop:
Sentiment Analysis
This workshop demonstrates how to integrate a Large Language Model (LLM) via Ollama with Pentaho Data Integration (PDI) to perform sentiment analysis on customer reviews. You'll learn how to build an ETL pipeline that reads customer feedback, analyzes sentiment using AI, and outputs structured results.
Workflow

Verify Ollama Installation.
Run through the following steps to build sentiment_analysis_optimized.ktr:
What is Sentiment Analysis?
Sentiment Analysis is the process of computationally identifying and categorizing opinions expressed in text to determine whether the writer's attitude toward a particular topic, product, or service is positive, negative, or neutral.
Example Input:
Example Output (Sentiment Analysis):
Why is Sentiment Analysis Important?
Business Applications:
Customer Feedback Analysis - Automatically categorize thousands of reviews to identify satisfaction trends
Brand Monitoring - Track public sentiment about your brand across social media and review sites
Product Improvement - Identify which features customers love and which need improvement
Customer Support Prioritization - Route angry customers to experienced support agents first
Market Research - Understand customer opinions about competitor products
Crisis Detection - Quickly identify negative sentiment spikes that require immediate attention
Real-World Example: A company receives 10,000 product reviews per month. Manual analysis would take weeks. With sentiment analysis:
Instant categorization: 7,500 positive, 1,800 neutral, 700 negative
Identify issues: Negative reviews mention "battery life" 450 times → product team investigates
Measure satisfaction: 75% positive sentiment score → track over time
Prioritize responses: Route the 200 most negative reviews to customer service
Types of Sentiment
1. Polarity (Basic)
Positive: "This product is amazing!"
Negative: "Terrible quality, waste of money"
Neutral: "The product arrived on Tuesday"
2. Granular Sentiment (Scored)
Very Positive: +0.8 to +1.0 ("Best purchase ever!")
Positive: +0.3 to +0.7 ("Good value for money")
Neutral: -0.2 to +0.2 ("It works as described")
Negative: -0.7 to -0.3 ("Not what I expected")
Very Negative: -1.0 to -0.8 ("Complete garbage, requesting refund")
3. Emotion-Based Sentiment (Advanced)
Joy: "So happy with this purchase!"
Anger: "This company has the worst customer service!"
Frustration: "Why doesn't this feature work properly?"
Disappointment: "Expected better quality for the price"
How LLMs Improve Sentiment Analysis
Traditional Methods (Rule-Based/ML):
Problems with Traditional Methods:
❌ Can't handle context: "This isn't bad" → Detected as negative (contains "bad")
❌ Misses sarcasm: "Oh great, another software bug" → Detected as positive (contains "great")
❌ Ignores negation: "Not good at all" → Detected as positive (contains "good")
❌ Limited to trained categories
❌ Requires extensive labeled training data
LLM-Based Sentiment Analysis:
Advantages of LLMs:
✅ Understands context and nuance
✅ Detects sarcasm and irony
✅ Handles negation correctly
✅ Provides explanations and reasoning
✅ Extracts key phrases automatically
✅ Works in multiple languages (multilingual models)
✅ No training data required (zero-shot learning)
✅ Customizable output format (JSON, XML, etc.)
Sentiment Analysis Output Components
In this workshop, our LLM will extract:
1. Sentiment Classification
Category: positive, negative, or neutral
Example:
"sentiment": "positive"
2. Sentiment Score
Numeric value from -1.0 (very negative) to +1.0 (very positive)
Example:
"score": 0.9(strongly positive)
3. Confidence Level
How certain is the LLM about this classification (0-100%)
Example:
"confidence": 95(very confident)Low confidence (<60%) might indicate mixed or ambiguous sentiment
4. Key Phrases
Important words/phrases that influenced the sentiment
Example:
["exceeded expectations", "incredible battery", "blazing fast"]Useful for identifying specific strengths or weaknesses
5. Summary
One-sentence summary of the review's main point
Example:
"Customer extremely satisfied with laptop performance and battery life"Helps quickly understand what the review is about
Use Cases in This Workshop
We'll analyze 3 customer reviews with varying sentiments:
Review 1 (Positive):
Review 2 (Negative):
Review 3 (Neutral/Mixed):
Expected Results
After running this workshop's transformation, you'll have:
Original review text
AI-determined sentiment (positive/negative/neutral)
Numeric score (-1.0 to 1.0)
Confidence percentage
Key phrases extracted
One-sentence summary
All in a structured CSV file ready for analysis, visualization, or database import!
Key Takeaways
Sentiment analysis automatically categorizes opinions in customer feedback
LLMs provide context-aware analysis that traditional methods can't match
Structured JSON output makes results easy to process in ETL pipelines
Confidence scores help identify reviews that need manual review
Key phrases identify specific strengths and weaknesses
Scalable processing - analyze thousands of reviews in minutes
Ollama API Endpoint
Ollama provides a REST API at http://localhost:11434
Key Endpoint: /api/generate
Sample Request Format
Parameters:
model: Which LLM model to useprompt: The instruction/question for the modelstream: false for complete responses (true for streaming)format: "json" to request JSON-formatted output
Sample Response Format
The actual LLM output is in the response field.
Test the API manually, enter this command:

The Response Structure
This is a successful response from the Ollama API. Here's what each part means:
Main Fields:
"model": "llama3.2:3b"- Confirms which model processed your request (Llama 3.2 with 3 billion parameters)"created_at"- Timestamp when the response was generated"response"- This is the actual AI-generated answer:The AI correctly identified the review as positive with a maximum confidence score of 1 (on the -1 to 1 scale you requested)
"done": true- Request completed successfully"done_reason": "stop"- Model finished naturally (not cut off due to length limits)
Performance Metrics:
"total_duration": 3131161470- Total time: ~3.1 seconds (in nanoseconds)"load_duration": 220384685- Model loading: ~0.2 seconds"prompt_eval_count": 64- Your prompt used 64 tokens"prompt_eval_duration": 115063978- Processing prompt: ~0.1 seconds"eval_count": 18- Response generated 18 tokens"eval_duration": 2088187708- Generating response: ~2.1 seconds
context
The array of numbers represents the internal token IDs used by the model - you can ignore this unless you're doing advanced work with conversation history.
Summary
The API successfully analyzed "This product is amazing! Best purchase ever." and correctly returned:
Sentiment: Positive
Score: 1.0 (maximum positivity)
The whole process took about 3 seconds. This is exactly what you'd use in your PDI transformation to get sentiment analysis results!

Run through the following steps to build sentiment_analysis_optimized.ktr:
Examine the customer reviews dataset.
The dataset contains:
review_id: Unique identifier
customer_name: Customer who wrote the review
product: Product being reviewed
review_text: The actual review content (this is what we'll analyze)
date: Review date
Double-click on the CSV file input step to review settings:

Configuration:
File path:
${INPUT_FILE}(parameter, defaults to../data/customer_reviews.csv)Delimiter: comma (,)
Enclosure: double quote (")
Header row: Yes
Encoding: UTF-8
Output Fields: review_id, customer_name, product, review_text, date
The transformation uses the INPUT_FILE parameter for flexibility. You can override this when running:
Double-click on the Modified JavaScript Value to review settings:

Step Type: Modified Java Script Value
Purpose: Construct the JSON payload for Ollama API
Key Logic:
Output Fields: prompt_text, json_payload
PDI Parameter Resolution in JavaScript:
PDI parameters work differently depending on where you use them:
XML tags
${PARAM}
✅ YES
<url>${OLLAMA_URL}/api/generate</url>
JavaScript strings
"${PARAM}"
❌ NO
var x = "${MODEL_NAME}"; stays literal
JavaScript code
getVariable()
✅ YES
var x = getVariable("MODEL_NAME", "default");
Correct JavaScript approach:
Incorrect approach (common mistake):
Prompt Engineering Tips:
Be explicit about the desired output format
Provide clear examples when possible
Request structured data (JSON) for easier parsing
Specify value ranges and types
REST Client
The REST client transformation step enables you to consume RESTful services.
Representational State Transfer (REST) is a key design idiom that embraces a stateless client-server architecture in which web services are viewed as resources and can be identified by their URLs.
You can escape input field data by using the Calculator step and the Mask XML content from string A or Escape HTML content function.
Double-click on the REST client step to review settings:

Use the REST Client step, not the generic HTTP Client step. The REST Client is specifically designed for REST APIs and handles POST requests properly.
Add a "REST Client" step
General Tab:
Application type: TEXT PLAIN
HTTP method: POST
URL:
${OLLAMA_URL}/api/generateBody field:
json_payload
Headers Tab:
Add: Content-Type = application/json
Add: Accept = application/json
Settings Tab:
Result field name:
llm_responseHTTP status code field:
response_codeResponse time:
response_timeSocket timeout: 300000
Connection timeout: 30000
Common Issues & Solutions:
405 Method Not Allowed
Using HTTP Client instead of REST Client
Change step type to REST Client
400 Bad Request
PDI parameters not resolved in JavaScript
Use getVariable() in JavaScript
Timeout errors
Socket timeout too low
Increase to 300000ms (5 minutes)
Connection refused
Ollama not running
Run curl http://localhost:11434/api/tags
Important Notes:
✅ LLM inference takes 20-30 seconds per review - this is normal
✅ Set socket timeout to at least 300000ms (5 minutes)
✅ Always check response_code field (should be 200)
✅ Use REST Client step, not HTTP Client
✅ Test Ollama API with curl before running transformation
JSON Input
This step extracts the actual LLM-generated content from Ollama's response wrapper.
Parse the LLM's JSON output into separate fields.
Double-click on the JSON Input step to review settings:

Configuration:
Source: Field value (llm_response)
JSON field path:
$.responseOutput field:
sentiment_json
Double-click on JSON input step to review settings:

Configuration:
Source: Field value (sentiment_json)
JSON paths:
$.sentiment→ sentiment (String)$.score→ score (Number, format: #.##)$.confidence→ confidence (Integer)$.key_phrases→ key_phrases (String)$.summary→ summary (String)
Double-click on Text fle output to review settings:

Configuration:
File name:
../datasets/sentiment_resultsExtension: .csv
Add date: Yes
Add time: Yes
Format: DOS (Windows line endings)
Encoding: UTF-8
Include header: Yes
Fields: All original fields + sentiment fields
Before you RUN the transformation, you will need to set the parameters.
Double-click anywhere on the canvas to display the transformatiom properties:

Transformation Parameters
The transformation uses these parameters for flexibility:
Basic Transformation (sentiment_analysis.ktr)
OLLAMA_URL
http://localhost:11434
Ollama API endpoint
MODEL_NAME
llama3.2:3b
Model to use for analysis
INPUT_FILE
../data/customer_reviews.csv
Input data path
Optimized Transformation (sentiment_analysis_optimized.ktr)
OLLAMA_URL
http://localhost:11434
Ollama API endpoint
MODEL_NAME
llama3.2:3b
Model to use for analysis
INPUT_FILE
../data/customer_reviews.csv
Input data path
KEEP_ALIVE
30m
Keep model in memory (5m/15m/30m/60m)
STEP_COPIES
4
Parallel copies (set to CPU cores - 1)
Parameter Details:
KEEP_ALIVE: Controls how long Ollama keeps the model loaded in memory
5m= 5 minutes (minimal memory usage)15m= 15 minutes (balanced)30m= 30 minutes (recommended - prevents reload overhead)60m= 60 minutes (for heavy workloads)STEP_COPIES: Number of parallel processing threads for the "Call Ollama API" step
Recommended: CPU cores - 1 (e.g., 8 cores = set to 4)
Higher = faster processing but more memory usage
Default: 4 (good for most systems)
File Path Note: The default INPUT_FILE parameter points to
../data/customer_reviews.csv. Make sure your data files are in thedata/folder, notdatasets/.
Performance Notes
Processing Time
3 reviews: ~69 seconds (~23 sec/review)
Expected for 100 reviews: ~38 minutes (at 23 sec/review)
Optimized version (parallel): Expected 3-4x faster
Recommendations
For testing: Use small datasets (3-10 reviews)
For production: Use optimized version with parallel processing
Adjust STEP_COPIES: Set to (CPU cores - 1) for optimal performance
Model choice:
llama3.2:1b = faster, less accurate
llama3.2:3b = balanced (recommended)
llama2:7b = slower, more accurate
Results
Input Data
Output Data with Sentiment Analysis
1
Sarah Johnson
Laptop Pro 15
positive
0.9
90%
✅ Correct
2
Mike Chen
Wireless Mouse
negative
-0.6
80%
✅ Correct
3
Emily Rodriguez
USB-C Hub
neutral
-0.33
70%
✅ Correct
View the complete results at: ~/LLM-PDI-Integration/workshops/workshop-01-sentiment-analysis/datasets/sentiment_results_optimized_timestamp.csv
Data Quality
Learn how to use Large Language Models (LLMs) to automatically clean, standardize, and enhance data quality in your ETL pipelines. This workshop demonstrates using Ollama with Pentaho Data Integration (PDI) to fix common data quality issues like inconsistent formatting, invalid data, and missing information.:
Common Data Quality Problems:
Inconsistent name formatting (john smith vs JOHN SMITH vs John Smith)
Invalid or malformed email addresses
Multiple phone number formats (+1-555-123-4567 vs 555.123.4567 vs (555) 123-4567)
Incomplete or inconsistent addresses
Company name variations (ACME CORP vs Acme Corp vs acme corp)
Traditional Solutions vs LLM Approach:
Regex/Rules
Fast, deterministic
Brittle, requires constant updates
Data Quality Tools
Comprehensive
Expensive, complex setup
Manual Cleaning
Accurate
Doesn't scale
LLM Approach
Flexible, intelligent, handles edge cases
Requires LLM infrastructure
Workflow

Verify Ollama Installation
Run through the following steps to build
data_quality_optimized.ktr:
Understanding Data Quality Challenges
Step 1: Examine the Raw Data
Navigate to the workshop folder and review the sample data:
Sample Records:
Data Quality Issues Identified:
1001
Lowercase
Mixed case
Dots separator
Lowercase, abbreviations
Lowercase
1002
All caps
Incomplete domain
Valid format
Good
Mixed case
1005
Good
Missing domain
Dashes
Abbreviations
Good
1010
Good
Valid
Parentheses
Lowercase
Good
Step 2: Define Quality Standards
Our target output standards:
Name
Title Case
John Smith
Phone
+1-555-123-4567
+1-555-123-4567
Address
Street, City, State ZIP
123 Main St Apt 5, New York, NY
Company
Proper Business Name
Acme Corp
Step 3: Traditional vs LLM Approach
Common Data Quality Problems:
Inconsistent name formatting (john smith vs JOHN SMITH vs John Smith)
Invalid or malformed email addresses
Multiple phone number formats (+1-555-123-4567 vs 555.123.4567 vs (555) 123-4567)
Incomplete or inconsistent addresses
Company name variations (ACME CORP vs Acme Corp vs acme corp)
Solution Comparison:
Regex/Rules
Fast, deterministic
Brittle, requires constant updates for edge cases
phone.replace(/[^\d]/g, '')
Data Quality Tools
Comprehensive features
Expensive ($20K-$50K+), complex setup (weeks)
Informatica, Talend DQ
Manual Cleaning
100% accurate
Doesn't scale, labor intensive
Excel find/replace
LLM Approach
Flexible, intelligent, handles edge cases
Requires LLM infrastructure
"Clean and standardize this data..."
Ollama API Endpoint
Ollama provides a REST API at http://localhost:11434
Key Endpoint: /api/generate
Sample Request Format
Key Parameters:
model:llama3.2:3b- Smaller, faster model optimized for structured tasksprompt: Compact instructions with example formatstream:false- Get complete response at oncekeep_alive:"5m"- Keep model loaded for 5 minutes (faster subsequent requests)temperature:0.1- Low randomness for consistent formattingnum_predict:300- Limit output tokens
Sample Response Format
Response Fields:
response: Contains the cleaned JSON data (as a string)done:truewhen generation is completeprompt_eval_count: Input tokens processed (85 tokens)eval_count: Output tokens generated (45 tokens)Total tokens: 130 tokens per record
Test the API manually, enter this command:
Expected Response:
Notice:
Name converted to Title Case
Email marked as
INVALID(incomplete domain@company)Phone already in correct format
Address capitalized and formatted
Company name preserved (already correct)

Run through the following steps to build data_quality_optimized.ktr:
JSON input
Creates the LLM prompt from input data
Input: Raw customer fields (name, email, phone, address, company_name) Output: llm_prompt (string)
x
x
Build Optimized Prompt (Modified Java Script Value)
JavaScript code:
Prompt Optimization Techniques:
❌ Removed: Verbose explanations ("Clean and standardize this customer record...")
❌ Removed: Detailed field descriptions ("Full Name in Title Case")
✅ Kept: Clear format example in JSON
✅ Kept: Abbreviated field labels to reduce tokens
Result: 50% shorter → 50% faster processing
x
x
x
Modifed JavaScript value
Wraps the prompt into Ollama API request format.
Input: llm_prompt (from Step 2) + transformation parameters
Output: request_body (JSON string ready for API)
Why separate? Handles API-specific configuration (model, temperature, keep_alive). Separates prompt logic from API plumbing.
Double-click on the MJV - Build JSON prompt - to review the settimgs:

Build JSON Request (Modified Java Script Value)
Use getVariable() for Parameters
Why getVariable()?
"${MODEL_NAME}"→ DOES NOT WORK in JavaScript strings (stays literal)getVariable("MODEL_NAME", "llama3.2:3b")→ WORKS (resolves to actual value)
REST Client
The REST client transformation step enables you to consume RESTful services.
Representational State Transfer (REST) is a key design idiom that embraces a stateless client-server architecture in which web services are viewed as resources and can be identified by their URLs.
You can escape input field data by using the Calculator step and the Mask XML content from string A or Escape HTML content function.
Double-click on the REST client step to review settings:

Call Ollama API (Parallel) - REST Client
Configuration:
Step Type:
REST Client(Rest)Method:
POSTURL:
${OLLAMA_URL}/api/generateBody Field:
request_bodyApplication type:
TEXT PLAINResult Fields:
Name:
api_responseCode:
result_codeResponse time:
response_time
Headers: (leave empty - REST Client auto-adds Content-Type)
Step Copies:
${STEP_COPIES}→ Default: 4 (parallel processing)
Step 7: Write Enhanced Data (Text File Output)
Configuration:
Filename:
../data/customer_data_enhanced_optimizedExtension:
.csvAdd date:
Y(adds_20260227)Add time:
Y(adds_134529)Result:
customer_data_enhanced_optimized_20260227_134529.csvFields:
customer_id,enhanced_name,enhanced_email,enhanced_phone,enhanced_address,enhanced_company
Transformation Parameters
Basic Transformation (data_quality_enhancement.ktr)
OLLAMA_URL
http://localhost:11434
Ollama API endpoint
MODEL_NAME
llama3.2:3b
Model to use for data cleaning
INPUT_FILE
../data/customer_data_raw.csv
Input data path
Optimized Transformation (data_quality_enhancement_optimized.ktr)
OLLAMA_URL
http://localhost:11434
Ollama API endpoint
MODEL_NAME
llama3.2:3b
Model to use (llama3.2:3b recommended)
INPUT_FILE
../data/customer_data_raw.csv
Input data path
KEEP_ALIVE
5m
Keep model in memory (5m/15m/30m/60m)
STEP_COPIES
4
Parallel API calls (set to CPU cores - 1)
Parameter Tuning Guide:
4 cores
4
1.0-1.2 rec/sec
8 cores
6-8
1.5-2.0 rec/sec
16 cores
12-14
2.5-3.5 rec/sec
Tested Configuration (Verified Working ✅)
Test Environment:
OS: Ubuntu 22.04 Linux
PDI: 11.0.0.0-237
Ollama: Latest
Model: llama3.2:3b
CPU: 4 cores
RAM: 16GB
Verified Parameters:
x
x
x
Parse JSON Response (Modified Java Script Value)
Error Handling Strategy:
Start with original values as fallback
Try to parse Ollama JSON response
Extract JSON object from response text (handles markdown code blocks)
Parse individual fields with fallback to original
Set
parsing_error = "Y"if anything failsKeep original values on error
x
x
x
x
x
x
x
x
x
Data Enrichment
Learn how to use Large Language Models (LLMs) to automatically enrich incomplete data by inferring missing information from context. This workshop demonstrates using Ollama with Pentaho Data Integration (PDI) to fill gaps in customer records, classify companies by industry, and adds valuable business intelligence.
Workflow

Verify Ollama Installation
Run through the following steps to build data_quality_optimized.ktr:
Understanding Data Enrichment Challenges
Examine the incomplete data
Navigate to the workshop folder and review the sample data:
Sample records:
Data completeness analysis:
1001
✅
✅
✅
❌
❌
❌
✅
❌
1002
✅
❌
❌
✅
✅
✅
❌
✅
1003
✅
✅
✅
❌
❌
✅
✅
❌
1004
✅
❌
✅
✅
✅
❌
✅
❌
1005
✅
✅
❌
❌
✅
✅
✅
✅
Missing data patterns
40% missing website.
35% missing contact name.
50% missing phone.
30% missing full address.
25% missing state.
45% missing country.
Define enrichment goals
Set target rules for each field:
Website
Infer from company name + domain patterns
acmecorp.com → www.acmecorp.com
Contact
Keep as UNKNOWN if not provided
John Smith or UNKNOWN
Phone
Infer area code from city/state
555-1234 → +1-415-555-1234 (SF)
Address
Keep street or mark UNKNOWN
123 Main St or UNKNOWN
City
Infer from state if missing
Texas → Houston (likely)
State
Infer from city or use 2-letter code
California → CA
Country
Default to USA if empty
USA
Industry (new)
Classify from company name/website
TechStart Inc → Technology
Employee Range (new)
Estimate from company name patterns
Acme Corp → 51-200
Compare enrichment and quality enhancement
Workshop 2 (Data quality):
Goal: fix incorrect or inconsistent data.
Input: messy but complete data.
Output: clean, standardized data.
Example:
john smith→John Smith.
Workshop 3 (Data enrichment):
Goal: add missing information.
Input: incomplete but clean data.
Output: complete data with inferred fields.
Example:
Acme Corp→ Industry:Technology, Size:51-200.
Comparison:
Input
Complete, messy data
Incomplete, clean data
Process
Standardize and validate
Infer and classify
Output
Clean existing fields
Add new fields
Risk
Low (validation)
Medium (inference accuracy)
Value-add
Consistency
New business intelligence
Pick an enrichment method
Traditional approaches:
Rule-based:
if company_name contains "Tech" then industry = "Technology"❌ Brittle.
❌ Needs constant updates.
Lookup tables: match company name against a database.
❌ Limited to known companies.
❌ Expensive to maintain.
External APIs: call company data APIs (Clearbit, FullContact).
❌ Costly ($0.50+ per enrichment).
❌ Subject to rate limits and quotas.
LLM approach:
✅ Infers from company name, website, and location.
✅ Handles ambiguity with probabilistic reasoning.
✅ Works for unknown companies.
✅ Runs locally with Ollama.
✅ Enriches many fields in one prompt.
Review request and response examples
Sample request format:
Key parameters:
format:"json"enforces JSON output.temperature:0.3allows some inference freedom.num_predict:400allows longer enriched responses.
Sample response format:
Enriched fields:
Original:
Acme Corp,acmecorp.com,John Smith,CA.Inferred:
www.acmecorp.com,San Francisco,USA.New:
Manufacturing(industry),201-500(estimated size).
Test the API manually
Expected response:
Notice:
Website inferred:
techstart.com(reasonable guess).Phone enriched: added LA area code
310.State inferred:
CA(from Los Angeles).Industry:
Technology(from "TechStart").Employee range:
11-50(startup indicator from name).
x

Run through the following steps to build data_quality_optimized.ktr:
x
x
Named Entity Recognition
This workshop demonstrates how to use Large Language Models (LLMs) with Pentaho Data Integration (PDI) to extract and classify named entities from unstructured text.
Named Entity Recognition (NER) is a critical NLP task that identifies and categorizes entities like people, organizations, locations, dates, and more.
Workflow

Verify Ollama Installation
Run through the following steps to build
named_entity_recognition_optimized.ktr:
Named Entity Recognition
Named Entity Recognition (NER) is the process of identifying and classifying named entities in unstructured text into predefined categories.
Example Input:
Example Output (Extracted Entities):
Entity Types in This Workshop
PERSON
Names of people
Sarah Johnson, Dr. Michael Chen, CEO Richard Davis
ORGANIZATION
Companies, institutions
Acme Corp, Stanford University, FBI
LOCATION
Cities, addresses, buildings
San Francisco, 123 Main St, Room 405
DATE
Dates and times
December 15th 2024, Feb 1st, 10:30 AM PST
PRODUCT
Product names/models
iPhone 16, UltraBook Pro X1, Widget X-200
MONEY
Currency amounts
$125,000, £250,000 GBP, $1,899.99
ID
Identifiers, tracking codes
CUST-98765, INV-2024-0089, ORD-2025-5678
TECHNOLOGY
Software, platforms
AWS, Python, TensorFlow, Docker
POSITION
Job titles, roles
CEO, Project Manager, CFO, VP of Engineering
Why Use LLMs for NER?
Traditional NER approaches (rule-based, statistical models, pre-trained NER models) have limitations:
Traditional Approach Challenges:
Requires extensive labeled training data
Struggles with domain-specific entities
Fixed entity type schemas
Poor performance on new/rare entity types
Cannot adapt to context easily
LLM-Based NER Advantages:
Zero-shot extraction (no training data needed)
Flexible entity type definitions
Handles multiple domains simultaneously
Contextual understanding (disambiguates entities)
Easy to add new entity types via prompt engineering
Extracts relationships and context
Real-World Use Cases:
Customer Service - Extract customer names, IDs, product references, dates from support tickets
Legal/Compliance - Identify parties, dates, amounts, locations in contracts
Log Analysis - Extract usernames, IP addresses, error codes, timestamps
Business Intelligence - Pull company names, products, revenue figures from reports
Healthcare - Extract patient names, medications, dates, doctors from medical records
Email Processing - Identify senders, recipients, dates, action items, referenced documents

Run through the following steps to build named_entity_recognition_optimized.ktr
x
x
x
x
x
x
x
Text Summarization
This workshop demonstrates how to use Large Language Models (LLMs) with Pentaho Data Integration (PDI) to automatically summarize long documents into concise, actionable insights. Text summarization is a critical capability for processing large volumes of documentation, extracting key information, and enabling faster decision-making.
Workflow

Verify Ollama Installation
Run through the following steps to build
text_summarization_optimized.ktr:
What is Text Summarization?
Text summarization is the process of condensing long documents into shorter versions while preserving the most important information and key points.
Example Input (Meeting Notes, 450 words):
Example Output (Summarization):
Summarization Types
Extractive
Selects key sentences from original
Quick overview, news
50-70%
Abstractive
Generates new text capturing meaning
Executive summary, reports
70-90%
Bullet Points
Lists key points/
Action tracking, presentations
80-95%
Key Takeaways
Main insights and conclusions
Decision support
85-95%
LLM-Based Summarization (This Workshop) uses abstractive methods to generate concise, coherent summaries that:
Rephrase content in clearer language
Combine related concepts
Identify and extract action items
Prioritize most important information
Adapt to different document types
Why Use LLMs for Summarization?
Traditional summarization approaches (extractive algorithms, keyword extraction, TF-IDF) have limitations:
Traditional Approach Challenges:
Cannot rephrase or generate new text
Miss implicit meaning and context
Struggle with complex document structures
Limited to sentence selection
No understanding of priorities or importance
LLM-Based Summarization Advantages:
True abstractive summarization (rewrites in clearer language)
Understands context and implicit information
Adapts to different document types automatically
Can extract different summary formats (bullets, paragraphs, action items)
Handles technical, business, and conversational text equally well
Multi-language capable
Identifies action items and key decisions
Real-World Use Cases
Executive Reporting - Summarize weekly status reports, meeting notes, project updates for leadership review
Customer Service - Condense customer complaint details and email threads for quick agent review
Legal/Compliance - Extract key terms, obligations, and deadlines from contracts and legal documents
Research & Analysis - Summarize academic papers, market research, technical documentation
Email Management - Create brief summaries of long email threads for quick scanning
Content Curation - Generate summaries for news articles, blog posts, industry reports
Meeting Documentation - Convert meeting transcripts into summaries with action items
Technical Documentation - Create executive-friendly summaries of technical specifications
x

Run through the following steps to build text_summarization_optimized.ktr
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Multi-Staged
This workshop teaches the most advanced LLM-ETL pattern: multi-stage pipelines where multiple LLM calls are chained sequentially, with each stage building on the outputs of previous stages.
Why Multi-Stage Pipelines?
Single LLM calls are powerful, but real-world AI systems often require:
Context accumulation: Each stage adds intelligence
Conditional logic: Different processing based on classification
Specialized prompts: Each stage focuses on one task
Error isolation: Failures in one stage don't break entire pipeline
Auditability: Track decisions at each stage
Workflow

Verify Ollama Installation
Run through the following steps to build
multi_staged_optimized.ktr:
What is a Multi-Stage Pipeline?
A multi-stage pipeline is an AI orchestration pattern where multiple LLM calls are chained together sequentially, with each stage performing a focused task and passing enriched data to the next stage.
Think of it like an assembly line in a factory:
Stage 1: Worker identifies the type of product (Classification)
Stage 2: Worker assesses quality and priority (Scoring)
Stage 3: Worker extracts specific components based on type (Conditional Extraction)
Stage 4: Worker creates assembly instructions (Action Generation)
Stage 5: Worker routes to appropriate department (Routing)
Each worker (LLM call) specializes in ONE task and has ALL the information gathered by previous workers.
Real-World Example: Intelligent Support Ticket Routing
Scenario: Your company receives 1,000 support tickets per day via email, chat, and web forms. You need to:
Categorize each ticket (bug report, feature request, billing issue, etc.)
Determine urgency (critical issues need immediate attention)
Extract relevant details (customer info, account value, issue description)
Generate action items for the assigned team
Route to the correct department with proper escalation
Single-Stage Approach (❌ What NOT to do):
Problems:
❌ 200+ token prompt (expensive and slow)
❌ LLM tries to do 6 different tasks at once (lower accuracy)
❌ Can't handle conditional logic (different ticket types need different extraction)
❌ If extraction fails, you lose everything
❌ No intermediate validation
❌ Difficult to debug which part failed
Multi-Stage Approach (✅ What we'll build):
Advantages:
✅ Focused prompts (40-80 tokens each, faster & cheaper)
✅ Higher accuracy (each LLM call does ONE thing well)
✅ Conditional logic (Stage 3 adapts based on Stage 1 result)
✅ Graceful degradation (if Stage 3 fails, you still have Stages 1-2)
✅ Auditable (see decision at each stage)
✅ Easy to debug (know exactly which stage failed)
Core Principles of Multi-Stage Pipelines
1. Single Responsibility Per Stage
Each stage has ONE job:
Stage 1: Classification ONLY
Stage 2: Priority scoring ONLY
Stage 3: Information extraction ONLY
Stage 4: Action generation ONLY
Stage 5: Routing ONLY
Why? Focused prompts produce better results than complex multi-task prompts.
2. Sequential Execution with Context Passing
Stages run in order, and each stage receives:
Original input data
ALL outputs from previous stages
Example Context Flow:
Why? Each stage makes BETTER decisions with full context from previous stages.
3. Conditional Branching
Different document types require different processing:
Why? A security incident needs different information than a billing complaint.
4. Error Isolation & Recovery
Each stage has fallback logic:
Why? One stage failure doesn't break the entire pipeline.
Comparison: Single-Stage vs Multi-Stage
Prompt Length
200+ tokens
40-80 tokens per stage
Accuracy
65-75% (trying to do too much)
85-95% (focused tasks)
Processing Time
60-90 seconds
50-70 seconds (5 calls @ 10-14s each)
Cost per Document
High (long prompt)
Lower (multiple short prompts)
Conditional Logic
❌ Not possible
✅ Full support
Error Handling
❌ All-or-nothing
✅ Per-stage recovery
Debugging
❌ Hard to isolate issues
✅ Know exactly which stage failed
Auditability
❌ Black box decision
✅ Track reasoning at each stage
Extensibility
❌ Hard to add features
✅ Easy to add new stages
When to Use Multi-Stage Pipelines
Use Multi-Stage Pipelines When:
✅ You need conditional processing (different types → different handling)
✅ You need to make sequential decisions (Stage 2 depends on Stage 1)
✅ You need auditability (track decision-making process)
✅ You need high accuracy (focused prompts perform better)
✅ You're building production systems (error isolation critical)
✅ Documents vary significantly in type/structure
Use Single-Stage When:
✅ Task is simple and uniform (all documents processed identically)
✅ Minimal conditional logic needed
✅ Low-stakes application (errors acceptable)
✅ Prototyping/testing (faster to build initially)
5-stage intelligent document router
Build the document router in 5 stages. Run stages in order. Pass outputs forward.
Stage 1: Classify the document
Identify what you are processing.
Input: raw document text
Output:
doc_type(for example,security_incident,legal_threat,critical_ticket),confidenceTypical duration: 10–15 seconds
Stage 2: Score urgency and priority
Decide how quickly to act. Use the doc_type to score appropriately.
Input: raw text +
doc_typefrom Stage 1Output:
urgency(1–10),priority(LOW/MEDIUM/HIGH/CRITICAL)Typical duration: 10–15 seconds
Stage 3: Extract type-specific details
Extract only the fields that matter for the doc_type. Use conditional branching here.
Input: raw text +
doc_type+urgencyOutput:
extracted_details(JSON; varies by type)Typical duration: 10–15 seconds
Stage 4: Generate action items
Generate concrete tasks. Use full context from Stages 1–3.
Input: all previous context
Output:
actions[],action_count,requires_escalationTypical duration: 10–15 seconds
Stage 5: Route to the right department
Assign ownership. Set the escalation path and SLA.
Input: all previous context
Output:
primary_dept,escalate_to,cc_depts,sla_hoursTypical duration: 10–15 seconds
Total processing time: 50–75 seconds per document (5 sequential LLM calls)
How Context Accumulation Works
Let's trace a real document through the pipeline:
Input Document:
Stage 1 Output:
Stage 2 Output (knows it's a security_incident):
Stage 3 Output (conditional extraction for security_incident):
Stage 4 Output (knows: security + critical + SQL injection + 90 days):
Stage 5 Output (knows: critical security incident requiring escalation):
Final Enriched Document has ALL this intelligence:
Original text preserved
Classified as security_incident (95% confidence)
Rated CRITICAL with urgency 10/10
SQL injection in login form, 90-day disclosure
4 specific action items generated
Routed to Security → CTO, CC Legal & PR, 2-hour SLA
Total Time: ~55 seconds (5 LLM calls)
Key Benefits Demonstrated
Contextual Intelligence: Stage 2 knows it's a security_incident (from Stage 1), so it applies appropriate urgency heuristics
Conditional Processing: Stage 3 extracts vulnerability-specific details because Stage 1 identified it as security_incident
Compound Context: Stage 4 generates security-specific actions because it knows type + urgency + vulnerability details
Intelligent Routing: Stage 5 routes to Security + CTO because it knows: CRITICAL + security + requires_escalation
Without multi-stage? You'd get generic results. With multi-stage? You get specialized, context-aware intelligence at every step.
Common Multi-Stage Patterns
Pattern 1: Classification → Conditional Processing
Pattern 2: Scoring → Priority-Based Routing
Pattern 3: Extract → Validate → Enrich
Pattern 4: Analyze → Recommend → Execute
Real-World Applications
1. Customer Support Automation
Classify ticket type
Score urgency based on type
Extract customer info & issue
Generate resolution steps
Route to appropriate team with SLA
2. Contract Review Pipeline
3. Content Moderation
4. Resume Screening
5. Financial Document Processing
Performance Considerations
Sequential Processing Tradeoff:
Pro: Each stage makes better decisions with accumulated context
Con: Slower than single-stage (5 calls vs 1 call)
Mitigation: Each call is faster (shorter prompts), net time comparable
Optimal Pipeline Length:
3-5 stages: Sweet spot for most use cases
2 stages: Usually better as single-stage
6+ stages: Consider if all are necessary (diminishing returns)
When to Parallelize:
Parallel processing WITHIN stages (4 copies of Stage 1 for 4 documents)
NOT between stages (Stage 2 needs Stage 1 output)
Key Takeaways
Multi-stage pipelines chain LLM calls where each stage builds on previous outputs
Context accumulation enables smarter decisions at each stage
Conditional logic allows different processing paths for different document types
Error isolation prevents cascade failures
Single responsibility per stage improves accuracy
Production-grade pattern used by companies processing millions of documents
Auditability tracks decision-making at every step
x

Run through the following steps to build multi_staged_optimized.ktr
x
x
x
Last updated
Was this helpful?
