> For the complete documentation index, see [llms.txt](https://academy.pentaho.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://academy.pentaho.com/pentaho-data-integration/data-integration/data-sources/flat-files/xml/read-xml.md).

# Read XML

{% hint style="warning" %}

#### Workshop - Read XML

Read XML from a file, a URL, or a field value. Use **Get data from XML**.

**What you’ll do**

* Read XML from a local file.
* Read XML from a URL (URI).
* Use **XPath** to select nodes and fields.
* Use **Get Fields** to infer the XML structure.
* Debug a data type mismatch using the logs.

**Prerequisites:** Basic transformations. Basic XML (elements, attributes, hierarchy). PDI installed.

**Estimated time:** 30 minutes
{% endhint %}

{% embed url="<https://www.loom.com/share/85ad9973848041b9b8447ed45cffc09c?hideEmbedTopBar=true&hide_owner=true&hide_share=true&hide_title=true>" %}
Get Data from XML
{% endembed %}

***

{% hint style="info" %}
**Workshop files**

Download these files before you start:

* The sample XML input.
* The starter transformation (optional).
  {% endhint %}

{% file src="/files/cXoi8spmhLLXNFUMxNWt" %}

***

<figure><img src="/files/Df8l3Bg4KIq875nD4ReF" alt="" width="375"><figcaption><p>Read XML</p></figcaption></figure>

{% hint style="info" %}
**Create a new transformation**

Use any of these options to open a new transformation tab:

* Select **File** > **New** > **Transformation**
* Use `Ctrl+N` (Windows/Linux) or `Cmd+N` (macOS)
  {% endhint %}

***

{% tabs %}
{% tab title="1. XML - File" %}
{% hint style="info" %}

#### **XML - File**

In this workflow, an XML **file** is parsed via **XPath** to retrieve the dataset.
{% endhint %}

<figure><img src="/files/Nfy0uxdm87mjDQK7wmbL" alt="" width="375"><figcaption><p>Get data from XML - file</p></figcaption></figure>

<figure><img src="/files/ep2niXC1hpTfWwHQ9pHu" alt="" width="375"><figcaption><p>orders.xml</p></figcaption></figure>

{% tabs %}
{% tab title="1. Get data from XML" %}
{% hint style="info" %}

#### **Get data from XML**

This step provides the ability to read data from any type of XML file using XPath specifications.
{% endhint %}

1. Start Pentaho Data Integration.

{% hint style="info" %}
{% tabs %}
{% tab title="Windows (PowerShell)" %}

```powershell
Set-Location C:\Pentaho\design-tools\data-integration
.\spoon.bat
```

{% endtab %}

{% tab title="macOS / Linux" %}

```bash
cd ~/Pentaho/design-tools/data-integration
./spoon.sh
```

{% endtab %}
{% endtabs %}
{% endhint %}

2. Drag the ‘Get data from XML’ step onto the canvas.
3. Double-click on the step, and configure the following properties:

<figure><img src="/files/3Rwfg6wwzMnvuF52ohRC" alt=""><figcaption><p>XML - file</p></figcaption></figure>

4. Click on the Content tab, and configure the following properties:

<figure><img src="/files/0jLOQx1ereeA6tCTCoiZ" alt=""><figcaption><p>XPath</p></figcaption></figure>

5. Click on the Fields tab, and then on the ‘Get Fields’ button.

<figure><img src="/files/Z6I0XT7n3TC8bdlswFUa" alt=""><figcaption><p>XML - fields</p></figcaption></figure>

6. Click OK.
   {% endtab %}

{% tab title="2. Dummy" %}
{% hint style="info" %}

#### **Dummy**

The Dummy step does not do anything. Its primary function is to be a placeholder for testing purposes. For example, to have a transformation, you need at least two steps connected to each other.
{% endhint %}

1. Drag a ‘Dummy’ step onto the canvas.
2. Create a hop from the ‘Get data from XML’ step.
3. Close the Step.
   {% endtab %}

{% tab title="3. RUN" %}
{% hint style="info" %}

#### **RUN Transformation**

The workshop illustrates how to ingest an XML data source. The XML can either stream from:

* a previous step (typically a URL)
* a file
* a stream field (XML stored in a field)
  {% endhint %}

{% hint style="warning" %}
Remember to disable the hops on the second workflow.
{% endhint %}

1. Click the Run button in the Canvas Toolbar.
2. Preview the data.

<figure><img src="/files/aiBBvxgp1W4V11uFDJEu" alt=""><figcaption><p>Preview data</p></figcaption></figure>
{% endtab %}
{% endtabs %}
{% endtab %}

{% tab title="2. XML - URI" %}
{% hint style="info" %}
In this workflow, a **URL** to an XML data source is parsed via **XPath** to retrieve the dataset.
{% endhint %}

<figure><img src="/files/0hZYvUgGKOiXLX0oLR70" alt="" width="375"><figcaption><p>Get XML - URL</p></figcaption></figure>

<figure><img src="/files/tSgfm42MAaiGb98J7Uc9" alt="" width="375"><figcaption><p><a href="https://www.w3schools.com/xml/plant_catalog.xml">https://www.w3schools.com/xml/plant_catalog.xml</a></p></figcaption></figure>

{% tabs %}
{% tab title="1. Generate rows - Pass URL" %}
{% hint style="warning" %}
In this workshop, you pass the URL in a data stream field.

Copy the URL to your clipboard. You will paste it into the XPath dialog.
{% endhint %}

{% hint style="info" %}

#### Generate rows

Generate rows outputs a specified number of rows. By default, the rows are empty; however, they can contain several static fields. This step is used primarily for testing purposes. It may be useful for generating a fixed number of rows, for example, you want exactly 12 rows for 12 months.
{% endhint %}

1. Drag the ‘Generate Rows’ step onto the canvas.
2. Double-click on the step, and configure the following properties:

<figure><img src="/files/H5Lp8FsHWG5llczDW4OE" alt=""><figcaption><p>Pass URL in data stream field</p></figcaption></figure>
{% endtab %}

{% tab title="2. Get data from XML - Read URL" %}
{% hint style="info" %}

#### Get data from XML

The dataset is being parsed from a stream field xmlUrl that’s being passed on from the ‘Pass URL’ step.
{% endhint %}

1. Drag the ‘Get Data from XML’ step onto the canvas.
2. Create a hop from the ‘Pass URL’ step.
3. Double-click on the step, and configure the following properties:

<figure><img src="/files/LGHUpXn7xI4ImE6nWF1U" alt=""><figcaption><p>Read URL</p></figcaption></figure>

4. Click on the ‘Content’ tab and configure the following properties:

<figure><img src="/files/0iNCbSr2uVEGbBvSLsv9" alt=""><figcaption><p>Select XPath</p></figcaption></figure>

5. Click on the ‘Fields’ tab and configure the following properties:

<figure><img src="/files/8zp1sYLVX2VCUUcmsn45" alt=""><figcaption><p>Configure fields</p></figcaption></figure>

6. Click on the ‘Get Fields’ button.

Next: open the **Dummy** tab.
{% endtab %}

{% tab title="3. Dummy" %}
{% hint style="info" %}

#### Dummy

The Dummy step does not do anything. Its primary function is to be a placeholder for testing purposes. For example, to have a transformation, you need at least two steps connected to each other.
{% endhint %}

1. Drag a ‘Dummy’ step onto the canvas.
2. Create a hop from the ‘Get data from XML’ step.
3. Close the Step.
   {% endtab %}

{% tab title="4. RUN" %}
{% hint style="danger" %}

#### **RUN the Transformation**

Remember to enable the hops and disable the hop in Workflow 1: XML - File

The workflow will fail .. do you know why.?
{% endhint %}

1. Click the Run button in the Canvas Toolbar

<figure><img src="/files/Wp5yh9CSg71fSu0xsT7l" alt="" width="375"><figcaption><p>Invalid data type</p></figcaption></figure>

2. Check the logs.

<figure><img src="/files/D4wwKoQlc9jqIXalyIjt" alt=""><figcaption><p>Logs</p></figcaption></figure>

{% hint style="warning" %}
Looks like Zone data type is alphanumeric (string), not integer.
{% endhint %}

3. Change Zone data type to string and re-run transformation.
4. Click on the Dummy step and Preview data.

<figure><img src="/files/vK9H1JYDED0g3nTNBHJB" alt=""><figcaption><p>Preview Plant Catalog</p></figcaption></figure>
{% endtab %}
{% endtabs %}
{% endtab %}
{% endtabs %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.pentaho.com/pentaho-data-integration/data-integration/data-sources/flat-files/xml/read-xml.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
