GenAI

Generative artificial intelligence (GenAI) can create certain types of images, text, videos, and other media in response to prompts ..

So does it work?

Users send Messages to the Thread, which the Assistant then processes.

The framework uses a Thread to maintain the context of a conversation.
Each interaction is added to the Thread as a Message.

Assistants can work with uploaded files, analyzing and referencing them in responses.

The framework maintains state across interactions, allowing for complex, multi-turn conversations.
The Assistant generates responses based on the conversation history and its capabilities.
Responses are generated asynchronously, allowing for handling of long-running tasks.

Assistants can produce various types of output, including text, code, or structured data.

Developers can fine-tune the Assistant's behavior through detailed instructions and model selection.

The HTML Parser is a utility plugin for Pentaho Data Integration (PDI) that extracts desired text from HTML or XML files. Useful for cleaning data for natural language processing tasks like sentiment analysis and SEO keyword analysis.

Accepts input from both data streams and files
Supports parsing using Xpath expressions or CSS selectors
Can process single files or multiple inputs from a stream
Compatible with local and virtual file systems

The plugin utilizes jsoup, a Java library, that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and Xpath selectors.

The step is located in the Input folder.

XPath (XML Path Language) is a query language for selecting nodes from an XML or HTML document. While Jsoup doesn't natively support XPath, we can use a combination of Jsoup and Java's built-in XPath capabilities to achieve this.

Here's an overview of some common XPath syntax:

/ - Selects from the root node

// - Selects nodes anywhere in the document

. - Selects the current node

.. - Selects the parent of the current node

@ - Selects attributes

[] -Used for predicates (conditions)

Some examples:

//div - Selects all div elements in the document
//div[@class='content'] - Selects all div elements with class 'content'
//h1/text() - Selects the text content of all h1 elements
//div[@class='content']/p - Selects all p elements that are direct children of div elements with class 'content'

HTML Data Source

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>XPath Example Page</title>
</head>
<body>
    <header>
        <h1 id="main-title">Welcome to Our Website</h1>
        <nav>
            <ul>
                <li><a href="#home">Home</a></li>
                <li><a href="#about">About</a></li>
                <li><a href="#contact">Contact</a></li>
            </ul>
        </nav>
    </header>

    <main>
        <section id="featured-articles">
            <h2>Featured Articles</h2>
            <article>
                <h3>Article 1</h3>
                <p class="content">This is the content of article 1.</p>
                <span class="author">By John Doe</span>
            </article>
            <article>
                <h3>Article 2</h3>
                <p class="content">This is the content of article 2.</p>
                <span class="author">By Jane Smith</span>
            </article>
        </section>

        <section id="latest-news">
            <h2>Latest News</h2>
            <ul>
                <li>News item 1</li>
                <li>News item 2</li>
                <li>News item 3</li>
            </ul>
        </section>
    </main>

    <footer>
        <p>&copy; 2024 Our Website. All rights reserved.</p>
    </footer>
</body>
</html>

Transformation

Filepath

The data source is referenced in a path.

Open the following transformation:

Windows

C:/Projects/genai/html/HTML Parser - Xpath.ktr

Linux

~/Projects/genai/html/HTML Parser - Xpath.ktr

Double-click on the hp: html and configure with the following settings:

Leaving Xpath field blank will result in all tags being removed and all the content returned.

3. RUN and preview the results.

These XPath queries will help you navigate and extract specific content from the homepage.html

Select the main title:

//h1[@id='main-title']

Select all navigation links:

//nav//a

Select all article titles (h3 elements within articles):

//article/h3

Select all paragraph content within articles:

//article/p[@class='content']

Select all author names:

//span[@class='author']

Select the latest news items:

//section[@id='latest-news']//li

Select the footer text:

//footer/p/text()

Select all section titles (h2 elements that are direct children of section elements):

//section/h2

Select the second article:

(//article)[2]

Select all elements with a class attribute:

//*[@class]

Filepath from stream

The data source is referenced as a filepath in a datastream field.

Enable the hop between: dg: filepath from stream -> hp: parse html xpath.
Disable the hop between: Data Grid -> hp: parse html xpath

dg: html from stream -> hp: parse html xpath

Double-click on the hp: html and configure with the following settings:

RUN and preview the results.

HTML from stream

The data source is referenced as <html> in a data stream field.

Pentaho's data streams often use binary fields to handle various types of data, including large text objects like HTML. By using binary datum, you ensure that the entire HTML content is treated as a single, uninterpreted chunk of data within the Pentaho pipeline - represented as 0 or 1.

Storing the HTML as binary datum allows you to pass the raw content through various steps in your Pentaho transformation without Pentaho trying to interpret or modify the HTML prematurely.

Enable the hop between: dg: html from stream -> hp: parse html xpath.
Disable the hop between: dg: filepath from stream -> hp: parse html xpath

Data Grid -> hp: parse html xpath

Double-click on the hp: html and configure with the following settings:

RUN and preview the results.

CSS selectors are powerful tools for targeting specific HTML elements, and they're used not only for styling but also for selecting elements when extracting data from HTML documents.

Below are some examples of the syntax used to extract HTML snippets

Basic Selectors

a) Element Selector:

p       /* Selects all <p> elements */
div     /* Selects all <div> elements */

b) Class Selector:

.highlight   /* Selects elements with class="highlight" */
p.highlight  /* Selects <p> elements with class="highlight" */

c) ID Selector:

#header   /* Selects the element with id="header" */

d) Universal Selector:

*   /* Selects all elements */

Combinators

a) Descendant Selector (space):

div p   /* Selects all <p> elements inside <div> elements */

b) Child Selector (>):

ul > li   /* Selects all <li> elements that are direct children of <ul> */

c) Adjacent Sibling Selector (+):

h1 + p   /* Selects the first <p> element immediately after an <h1> */

d) General Sibling Selector (~):

h1 ~ p   /* Selects all <p> elements that are siblings of <h1> */

Attribute Selectors

a) [attribute]:

[type]   /* Selects elements with a type attribute */

b) [attribute="value"]:

[type="text"]   /* Selects elements with type="text" */

c) [attribute~="value"]:

[class~="highlight"]   /* Selects elements with class containing "highlight" as a whole word */

d) [attribute^="value"]:

[href^="https"]   /* Selects elements with href starting with "https" */

e) [attribute$="value"]:

[href$=".pdf"]   /* Selects elements with href ending with ".pdf" */

f) [attribute*="value"]:

[href*="example"]   /* Selects elements with href containing "example" */

Pseudo-classes

a:first-child     /* Selects every <a> element that is the first child of its parent */
p:last-child      /* Selects every <p> element that is the last child of its parent */
li:nth-child(2n)  /* Selects every even <li> element */
input:not(:checked)  /* Selects all unchecked input elements */

Combining Selectors

div.highlight, p.important   /* Selects <div> with class "highlight" and <p> with class "importa

HTML Data Source

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>TechGadgets - Your Electronics Store</title>
</head>
<body>
    <header id="main-header">
        <h1>TechGadgets</h1>
        <nav>
            <ul>
                <li><a href="#home">Home</a></li>
                <li><a href="#products">Products</a></li>
                <li><a href="#about">About</a></li>
                <li><a href="#contact">Contact</a></li>
            </ul>
        </nav>
    </header>

    <main>
        <section id="featured-products">
            <h2>Featured Products</h2>
            <div class="product">
                <h3>Smartphone X</h3>
                <p class="description">The latest smartphone with advanced features.</p>
                <span class="price">$999</span>
            </div>
            <div class="product">
                <h3>Laptop Pro</h3>
                <p class="description">Powerful laptop for professionals.</p>
                <span class="price">$1499</span>
            </div>
        </section>

        <section id="about">
            <h2>About Us</h2>
            <p>TechGadgets is your one-stop shop for all electronics needs.</p>
        </section>

        <section id="newsletter">
            <h2>Subscribe to Our Newsletter</h2>
            <form>
                <input type="email" name="email" placeholder="Enter your email">
                <button type="submit">Subscribe</button>
            </form>
        </section>
    </main>

    <footer>
        <p>&copy; 2024 TechGadgets. All rights reserved.</p>
    </footer>
</body>
</html>

Transformation

Filepath

The data source is referenced in a path.

Open the following transformation:

Windows

C:/Projects/genai/html/HTML Parser - CSS.ktr

Linux

~/Projects/genai/html/HTML Parser - CSS.ktr

Double-click on the hp: html and configure with the following settings:

Leaving CSS field blank will result in all tags being removed and all the content returned.

RUN preview the results.

These CSS queries will help you navigate and extract specific content from the landingpage.html

Select the main title:

h1

Select all navigation links:

nav a

Select all product titles:

.product h3

Select all product descriptions:

.product .description

Select all product prices:

.product .price

Select the "About Us" section:

#about

Select the newsletter form:

#newsletter form

Select all section titles (h2 elements):

main h2

Select the footer text:

footer p

Select all elements with a class of "product":

.product

Filepath from stream

The data source is referenced as a filepath in a datastream field.

Enable the hop between: dg: filepath from stream -> hp: parse html xpath.
Disable the hop between: Data grid -> hp: parse html css

dg: html from stream -> hp: parse html css

Double-click on the hp: html and configure with the following settings:

RUN preview results.

HTML from stream

The data source is referenced as <html> in a data stream field.

Storing the HTML as binary datum allows you to pass the raw content through various steps in your Pentaho transformation without Pentaho trying to interpret or modify the HTML prematurely.

Enable the hop between: dg: html from stream -> hp: parse html css.
Disable the hop between: dg: filepath from stream -> hp: parse html css

Data grid -> hp: parse html css

Double-click on the hp: html and configure with the following settings:

RUN preview results.

Apache Tika is a content analysis toolkit that extracts text, metadata, and language from a variety of file formats. It's commonly used in data processing to prepare data for further analysis.

Supports a wide range of document formats including PDFs, Word documents, and HTML files.
Extracts metadata such as author, title, creation date, and language.
Can be integrated into larger data processing pipelines for automated content extraction.
Facilitates full-text search indexing and content classification.

The step is located in the Input folder.

The data source is a word document.

The document type is referenced in a datastream field.

Word Document

The old oak tree stood sentinel at the edge of the meadow, its gnarled branches reaching skyward like ancient fingers grasping at clouds. Generations had passed beneath its sprawling canopy, each leaving whispered secrets in its bark. A gentle breeze rustled through its leaves, carrying the scent of wildflowers and distant rain. Nearby, a babbling brook wound its way through moss-covered stones, its crystalline waters reflecting the dappled sunlight filtering through the forest canopy. 
A family of deer cautiously approached the water's edge, their ears twitching at every sound. In the distance, a woodpecker's rhythmic tapping echoed through the trees, nature's own percussion. As the sun began its slow descent, the meadow came alive with the soft glow of fireflies, their bioluminescent dance a magical display against the deepening twilight. A lone owl hooted softly, heralding the arrival of night and all its mysterious inhabitants. The air grew cooler, and dew began to form on blades of grass, each droplet a miniature world reflecting the stars above. 
In this timeless moment, the boundary between earth and sky seemed to blur, and one could almost believe in the old tales of fairies and woodland spirits. As darkness settled fully over the land, the oak tree stood as it always had, a silent guardian of the forest's secrets, its roots deep in the earth, its crown brushing the heavens.

Open the following transformation:

Windows

C:/Projects/genai/tika/Read Unstructured Document- Word Doc.ktr

Linux

~/Projects/genai/tika/Read Unstructured Document- Word Doc.ktr

Double-click on the Read Unstructured Document step and configure with the following settings:

RUN and preview the results.

Base64 is a binary-to-text encoding scheme that represents binary data in an ASCII string format. It's widely used for transmitting data over media that are designed for textual data.

The BASE64 step is located in the Transform folder.

Consider the sentence Hi, where the \n represents a newline. The first step in the encoding process is to obtain the binary representation of each ASCII character. This can be done by looking up the values in an ASCII-to-binary conversion table.

ASCII uses 8 bits to represent individual characters, but Base64 uses 6 bits. Therefore, the binary needs to be broken up into 6-bit chunks.

Finally, these 6-bit values can be converted into the appropriate printable character by using a Base64 table.

Since Base64 uses 24-bit sequences, padding is needed when the original binary cannot be divided into a 24-bit sequence. You have probably seen this type of padding before represented by printed equal signs (=). For example, Hi without a newline is represented by only two 8-bit ASCII characters (for a total of 16 bits). Padding is removed by the Base64 encoding schema when data is decoded.

Pentaho GenAI is an extension of the Pentaho Data Integration that incorporates generative AI capabilities. It aims to enhance data workflows processes by leveraging large language models and other AI technologies.

OpenAI released an API platform that enables the creation of 'assistants' that can perform a wide range of tasks:

Natural Language Querying: Users can ask questions or provide prompts to large language models (LLMs) like OpenAI and Azure OpenAI, allowing for natural language interaction with data and systems.

Document Analysis: The plugin supports attaching documents for LLMs to process, enabling users to analyze and extract insights from text files and related documents.

Sentiment Analysis: The plugin can, for example, be used to determine the sentiment of text data, such as tweets.

Log Analysis: Process and analyze log files, potentially for troubleshooting or identifying patterns.

Structured Data Generation: The plugin supports generating responses in both text and JSON formats, allowing for the creation of structured data from natural language inputs.

Data Extraction and Transformation: The plugin can be used within Pentaho Data Integration (PDI) workflows, assisting in extracting and transforming data as part of larger ETL processes.

Question Answering: The plugin supports using document embeddings to efficiently answer multiple questions about a document(s), making it useful for information retrieval and FAQ-style applications.

Prompt Engineering: Users can create structured templates and use PDI environment variables for dynamic prompt generation, allowing for flexible and customizable interactions with LLMs.

Moderation and Content Filtering: The plugin includes options for response moderation, which can be used to filter / flag potentially harmful or inappropriate content.

Resources

Pentaho Data Integration

Let's start exploring some simple chat scenarios:

Enter the prompt directly.
Pass the prompt and 'role' in data stream fields.
Configure the step to use your own OpenAI account details.

The step is located in the AI folder.

Enter Prompt

Enable the hop between: Data Grid -> AI Chat.
Disable the hop bewteen: User Input -> AI Chat.

Open the following transformation:

Windows

C:/Projects/genai/ai chat/.ktr

Linux

~/Projects/genai/html/HTML Parser - Xpath.ktr

Double-click on the hp: html and configure with the following settings:
Double-click on the AI Chat step and configure with the following settings:

Run Instruction for LLM

Role-playing with Large Language Models (LLMs), such as ChatGPT, is an emerging field that explores the interaction between AI and creative, narrative-driven experiences. It leverages the advanced capabilities of LLMs to simulate human-like dialogue and human behavior within a role-playing context.

This process enables the AI to engage in dynamic conversations, mimic various characters, and respond to user inputs in a manner that aligns with the character’s predefined traits and narrative context, making use of powerful computation of large corpus of text data. In this way, this role-playing technique enhances its efficiency in tasks that require specific skills or knowledge, such as acting like a historian or providing historical facts and analyses.

Click on the Model tab.

The temperature value ranges from 0 to 2, with lower values indicating greater determinism and higher values indicating more randomness.

The moderations endpoint is a tool you can use to check whether text is potentially harmful. Developers can use it to identify content that might be harmful and take action, for instance by filtering it.

RUN and preview the result.

Prompt from Data Stream fields

You can send multiple questions to ChatGPT.

However, there are limits to the number of requests an LLM model can accept. The response will fail if the threshold limit is reached.

Disable the hop between: Data Grid -> AI Chat.
Enable the hop bewteen: User Input -> AI Chat.
Double-click on the User Input step and the Data tab.

Double-click on the AI Chat step and configure with the following settings:

RUN and preview result.

Configure Model with your own account details

The 'pipeline' configuration is the same as the previous scenario.

You will require to enter your own OpenAI key.

Double-click anywhere on the canvas to configure the Parameters.

Enter your own OpenAI Key.

Double-click on the AI Chat step and configure with the following settings.

RUN and preview the result. Should be the same as the previous scenario ..!!

A prompt is essentially the input given to an AI model to elicit a desired output or behavior. It can range from simple questions to complex instructions or examples.

Prompt engineering is the art and science of crafting these inputs to optimize the AI's performance for specific tasks. This involves carefully selecting words, providing context, and structuring the prompt to guide the model towards producing the most accurate, relevant, and useful responses.

Double-click anywhere on the canvas to set the parameters.

Double-click on the Chat AI step and configure with the following settings:

RUN and preview the result.

Prompt - Template

With a little prompt engineering, the response can populate a 'template'.

Double-click on the Chat AI step and configure with the following settings:

Create a recipe for a ${DISHTYPE} with the following ingredients: ${INGREDIENTS}.

Structure your answer in the following way:
Recipe name: ... 
Description: ... 
Preparation time: ... 
Required ingredients: 
 - ...
 - ... 
Instructions: 
 - ...
 - ...
 Respond in JSON format.

RUN and preview the result.

Let's run through some Use Cases:

Sentiment Analysis - Determine the sentiment of a tweet: Positive, Neutral, Negative.

Log Analysis - Analyze multiple log files for any errors. The errors are hopefully resolved by AI Chat with the results written to a CSV file.

Analyzes multiple log files to identify errors, then using AI Chat to provide with a resolution. The generated result is in JSON format and the processed output is stored as a CSV file.

Open the following transformation:

Windows

C:/Projects/genai/aichat/Usecase - Log Analysis.ktr

Linux

~/Projects/genai/aichat/Usecase - Log Analysis.ktr

Double-click on the hp: html and configure with the following settings:

The Prompt has been engineered to analyze log files identifying errors. The errors are resolved using

Double-click on the AI Chat step to view the settings.

Message / Prompt

Analyze the log file from the stream and identify the issue. Once the issueis identified, respond with possible resolutions to fix the issue. Include the date (IssueDate) of when the issue occurred.
If no issue is found, then respond as "No Issues found" and Resolution as "No Resolution suggested. Log Looks fine".
Reply the answer in the below JSON template:
{
	"Issue" : "...",
	"IssueDate": "...",
	"Resolution" : "..."
}

Document

Determines location the data source:

File - Browse and enter the path to the data source

Stream - the data (or reference) is being passed in a data stream field. In this workshop the paths to the log files are being passed from the previous step in the filename data stream field.

Model

Click on the Model tab.

The temperature value ranges from 0 to 2, with lower values indicating greater determinism and higher values indicating more randomness.

Embedding

Click on the Embedding tab.

Enter your embedding model and whether you want to create and persist in a file or keep it as default In-Memory.

Take a look at the RAG workshop.

Response

Click on the Response tab.

The response is held as a JSON object in the result field.

The response JSON object needs to be parsed to create our data stream fields.

Double-click on the Process generated JSON result step.

Click on the Fields tab.

Take a look at the result field (preview data in AI Chat step) to determine the structure of the JSON object / array.

This should reflect the structure of the prompt template.

 {
   "Issue": "Errors initializing Table output step and executing query job due to Simba driver limitations.",
   "IssueDate": "2023-09-07",
   "Resolution": "Use the GBQ Bulk Loader step instead of the regular Table output step to create the table and handle data inserts."
 }
...

JSON Notation

JSON Exp

Description

Sample

Root object

$ returns the whole JSON structure

Child operator; it's used to access different levels of the JSON structure

$..Issue returns the Issue

RUN and preview result - Output - log-analysis.csv

One of the most common problems for large, high-growth businesses is dealing with increasing volumes and varieties of financial data - more specifically, extracting the data from PDF documents such as quarterly reports, balance sheets, bank statements and cash flow statements.

Without a solution to handle these data extraction tasks at scale, operations quickly become error-prone and time-consuming. This is why a growing number of organisations are now implementing AI data extraction tools.

In this use case we're going to extract sales data from PDF reports, using Pentaho Data Integration.

Review the main steps of the

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

The previous step - Get file names returns the paths to the PDFs.

Double-click on the Read Unstructured Document step to view settings.

Based on the filenames passed in the filename field, the pdf contents are extracted and associated with the pdf_file_contents data stream field.

This is where we have to put our thinking caps on ..

In the generated_response data stream field the SaleYear & SaleMonth sales data is defined as JSON objects with an array for each:

ProductCategory

UnitSold

Revenue

This will have to be a 2 stage process.

Stage 1 - is to extract the SaleYear & SaleMonth.

Record: SaleYear: SaleMonth:

2024 August
2024 July

Stage 2 - iterates through the SalesPerformanceByProduct[array] for each Stage 1 record.

So .. on the first iteration SaleYear = 2024 SaleMonth = August

ProductCategory: UnitSold: Revenue:

Eco-Gear 1,500 $150,000

Smart Home Devices 1,200 $180,000

Fitness Equipment 960 $96,000

Accessories 1,200 $74,000

This is repeated for Record 2 ..

generated_response


Conclusion 

August 2024 was a positive month for Acme Corporation, marked by notable growth in sales and 
customer retention. However, addressing regional disparities and capitalizing on successful 
product lines will be crucial for sustaining this growth momentum in the coming months.	
{
  "SaleYear": "2024",
  "SaleMonth": "August",
  "SalesPerformanceByProduct": [
    {
      "ProductCategory": "Eco-Gear",
      "UnitSold": "1,500",
      "Revenue": "$150,000"
    },
    {
      "ProductCategory": "Smart Home Devices",
      "UnitSold": "1,200",
      "Revenue": "$180,000"
    },
    {
      "ProductCategory": "Fitness Equipment",
      "UnitSold": "960",
      "Revenue": "$96,000"
    },
    {
      "ProductCategory": "Accessories",
      "UnitSold": "1,200",
      "Revenue": "$74,000"
    }
  ]
}


Conclusion 

July 2024 was a solid month for Acme Corporation, characterized by successful product launches 
and moderate sales growth. While the overall performance was positive, addressing regional 
disparities and improving competitive positioning in certain product categories will be essential for 
sustaining growth in the coming months.	
{
  "SaleYear": "2024",
  "SaleMonth": "July",
  "SalesPerformanceByProduct": [
    {
      "ProductCategory": "Eco-Gear",
      "UnitSold": "1,250",
      "Revenue": "$125,000"
    },
    {
      "ProductCategory": "Smart Home Devices",
      "UnitSold": "1,100",
      "Revenue": "$165,000"
    },
    {
      "ProductCategory": "Fitness Equipment",
      "UnitSold": "900",
      "Revenue": "$105,000"
    },
    {
      "ProductCategory": "Accessories",
      "UnitSold": "1,250",
      "Revenue": "$60,000"
    }
  ]
}
...

It would be interesting to give this a go using: Hierachical Data Type (HDT) EE plugin

PreviousJenkins NextData Sources

Last updated 2 months ago

Was this helpful?