Read XML
XML data sources ..
Workshop - Read XML
XML (eXtensible Markup Language) remains a prevalent data exchange format across industries—from web services and APIs to EDI transactions and configuration files. Organizations frequently need to integrate XML data from partner systems, REST APIs, SOAP services, and legacy applications. Understanding how to parse and transform XML data is essential for building comprehensive data integration solutions that connect diverse systems.
In this hands-on workshop, you'll explore three different approaches for retrieving XML data using PDI's "Get data from XML" step. Steel Wheels receives data in XML format from multiple sources, and you'll learn to handle each scenario: reading from local files, fetching from web URIs, and processing XML content passed through data streams. This flexibility ensures you can integrate XML data regardless of how it's delivered to your transformation.
What You'll Accomplish:
Configure the Get data from XML step for file-based sources
Use XPath expressions to navigate and extract data from XML hierarchies
Leverage the "Get Fields" feature to automatically discover XML structure
Read XML data directly from web URIs and REST endpoints
Process XML content from field data using the stream input method
Handle data type mismatches and debug XML parsing errors using log output
Work with multiple XML input patterns in a single transformation
Understand when to use each XML input method based on your data source
By the end of this workshop, you'll have practical experience with all three XML input methods and understand which approach best fits different integration scenarios. You'll also develop troubleshooting skills by intentionally encountering and resolving a data type error—a common challenge when working with XML sources where data types aren't always explicitly defined. Rather than avoiding XML integration or relying on pre-processing scripts, you'll build native PDI solutions that handle XML data confidently and efficiently.
Prerequisites: Understanding of basic transformation concepts, familiarity with XML structure (elements, attributes, hierarchy); Pentaho Data Integration installed and configured
Estimated Time: 30 minutes



Get data from XML
This step provides the ability to read data from any type of XML file using XPath specifications.
Start Pentaho Data Integration.
Drag the ‘Get data from XML’ step onto the canvas.
Double-click on the step, and configure the following properties:

Click on the Content tab, and configure the following properties:

Click on the Fields tab, and then on the ‘Get Fields’ button.

Click OK.
RUN Transformation
The workshop illustrates how to ingest an XML data source. The XML can either stream from:
a previous step - typically a URI
a file
stream - defined in a data stream field
Remember to disable the hops on the second workflow.
Click the Run button in the Canvas Toolbar.
Preview the data.


In this workshop we're going to pass the URL in a datastream field.
Ensure you have copied the URL into the clipboard for X path.
Generate rows
Generate rows outputs a specified number of rows. By default, the rows are empty; however, they can contain several static fields. This step is used primarily for testing purposes. It may be useful for generating a fixed number of rows, for example, you want exactly 12 rows for 12 months.
Drag the ‘Generate Rows’ step onto the canvas.
Double-click on the step, and configure the following properties:

Get data from XML
The dataset is being parsed from a stream field xmlUrl that’s being passed on from the ‘Pass URL’ step.
Drag the ‘Get Data from XML’ step onto the canvas.
Create a hop from the ‘Pass URL’ step.
Double-click on the step, and configure the following properties:

Click on the ‘Content’ tab and configure the following properties:

Click on the ‘Fields’ tab and configure the following properties:

Click on the ‘Get Fields’ button.
➡️ Next: Dummy
RUN the Transformation
Remember to enable the hops and disable the hop in Workflow 1: XML - File
The workflow will fail .. do you know why.?
Click the Run button in the Canvas Toolbar

Check the logs.

Looks like Zone data type is alpha numeric = string instead of integer.
Change Zone data type to string and re-run transformation.
Click on the Dummy step and Preview data.

Last updated
Was this helpful?

