microchip-aiGenAI

Generative artificial intelligence (GenAI) can create certain types of images, text, videos, and other media in response to prompts ..

GenAI - OpenAI assistants
circle-info

So does it work?

Users send Messages to the Thread, which the Assistant then processes.

  • The framework uses a Thread to maintain the context of a conversation.

  • Each interaction is added to the Thread as a Message.

Assistants can work with uploaded files, analyzing and referencing them in responses.

  • The framework maintains state across interactions, allowing for complex, multi-turn conversations.

  • The Assistant generates responses based on the conversation history and its capabilities.

  • Responses are generated asynchronously, allowing for handling of long-running tasks.

Assistants can produce various types of output, including text, code, or structured data.

Developers can fine-tune the Assistant's behavior through detailed instructions and model selection.

Start Pentaho Data Integration:

circle-info

HTML Parser

The HTML Parser is a utility plugin for Pentaho Data Integration (PDI) that extracts desired text from HTML or XML files. Useful for cleaning data for natural language processing tasks like sentiment analysis and SEO keyword analysis.

  • Accepts input from both data streams and files

  • Supports parsing using Xpath expressions or CSS selectors

  • Can process single files or multiple inputs from a stream

  • Compatible with local and virtual file systems

The plugin utilizes jsoup, a Java library, that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and Xpath selectors.

The step is located in the Input folder.


Select XPath or CSS Selectors as the parsing method:

circle-info

XPath

XPath (XML Path Language) is a query language for selecting nodes from an XML or HTML document. While Jsoup doesn't natively support XPath, we can use a combination of Jsoup and Java's built-in XPath capabilities to achieve this.

Here's an overview of some common XPath syntax:

/ - Selects from the root node

// - Selects nodes anywhere in the document

. - Selects the current node

.. - Selects the parent of the current node

@ - Selects attributes

[] -Used for predicates (conditions)

Some examples:

  • //div - Selects all div elements in the document

  • //div[@class='content'] - Selects all div elements with class 'content'

  • //h1/text() - Selects the text content of all h1 elements

  • //div[@class='content']/p - Selects all p elements that are direct children of div elements with class 'content'


HTML Data Source

circle-info


Select HTML source:

circle-info

Filepath

HTML Parser - Xpath
  1. Open the following transformation.

Filepath

circle-info

The data source is referenced in a path.

Linux

~/Projects/genai/html/HTML Parser - Xpath.ktr

  1. Double-click on the hp: html and configure with the following settings:

Set path to file
circle-info

Leaving Xpath field blank will result in all tags being removed and all the content returned.

3. RUN and preview the results.

Results - no Xpath
circle-info

These XPath queries will help you navigate and extract specific content from the homepage.html

Select the main title:

Select all navigation links:

Select all article titles (h3 elements within articles):

Select all paragraph content within articles:

Select all author names:

Select the latest news items:

Select the footer text:

Select all section titles (h2 elements that are direct children of section elements):

Select the second article:

Select all elements with a class attribute:

x

x

x

x

Last updated

Was this helpful?