# Text File Input

{% hint style="warning" %}

#### Workshop - Text File Input

Real-world data rarely arrives in perfect, structured formats ready for database loading. Organizations frequently receive orders, invoices, and other business documents as unstructured or semi-structured text files that require significant transformation before they can be analyzed. Learning to parse, cleanse, and structure these files is an essential skill for any data integration professional.

In this hands-on workshop, you'll work with Steel Wheels' order data delivered in a challenging text format. You'll build a complete transformation pipeline that takes messy, multi-line text records and converts them into clean, structured rows suitable for database insertion. This workshop introduces several powerful PDI steps for text manipulation, pattern matching, and data formatting techniques you'll use repeatedly when integrating data from legacy systems, EDI feeds, or flat file exports.

**What you'll do**

* Configure the Text File Input step to read unstructured text data
* Use the Flattener step to convert multi-line records into single rows
* Apply Regular Expressions (RegEx) to extract specific data patterns and create capture groups
* Implement the Replace in String step to remove unwanted text and formatting
* Perform explicit data type conversions using the Select Values step
* Format currency values and dates for proper database storage
* Build a complete text processing pipeline from raw input to structured output

By the end of this workshop, you'll understand the multi-step process required to onboard flat files into database tables. You'll have practical experience with pattern matching, string manipulation, and data type conversion - core competencies that enable you to tackle even the most challenging text file formats. Instead of relying on manual data clean-up or complex pre-processing scripts, you'll build automated, repeatable transformations that handle messy data with confidence.

**Prerequisites:** Understanding of basic transformation concepts (steps, hops, preview); Pentaho Data Integration installed and configured

**Estimated time:** 30 minutes
{% endhint %}

{% embed url="<https://www.loom.com/share/6b3348c091764d08806280206bd53434?hideEmbedTopBar=true&hide_owner=true&hide_share=true&hide_title=true>" %}
Text File Input
{% endembed %}

***

{% hint style="info" %}
**Workshop files**

Download the following files.

Keep the filenames unchanged.

Save them in your workshop folder.
{% endhint %}

{% file src="<https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-cc1f4366af20d6f2d1ceb4554f6f8e93e131f081%2Ftr_read_text.ktr?alt=media>" %}

{% file src="<https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-4bcd4c6d0b7f2c1b5105bfc7fe6a8f3cb7321b49%2Forders.txt?alt=media>" %}

***

{% hint style="info" %}
**Create a new transformation**

Use any of these options to open a new transformation tab:

* Select **File** > **New** > **Transformation**
* Use `Ctrl+N` (Windows/Linux) or `Cmd+N` (macOS)
  {% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-e5974522006c36725392a6f55041baccf8f330b8%2Ftxt%20file%20input.png?alt=media" alt=""><figcaption><p>Text files</p></figcaption></figure>

***

{% hint style="info" %}
Review the input file first. It will guide your parsing approach.
{% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-5786bfa84e1e6e54928d905122698b43e67a1325%2Forders.txt.png?alt=media" alt="" width="563"><figcaption><p>orders.txt</p></figcaption></figure>

{% hint style="info" %}
What to notice:

* Each order spans multiple lines.
* Line 3 contains two values: order status and order date.
* Order value includes a currency symbol ($).
* There is inconsistent whitespace.
  {% endhint %}

{% hint style="info" %}
**Approach**

You will:

* Flatten multi-line records into a single row.
* Extract values into new fields (capture groups).
* Clean strings (remove labels and currency symbols).
* Set data types and formats (date and number).
  {% endhint %}

{% embed url="<https://www.loom.com/share/0646ed96c4734d16b6721c902f9c7e0e?hideEmbedTopBar=true&hide_owner=true&hide_share=true&hide_title=true>" %}
String Operations
{% endembed %}

***

{% tabs %}
{% tab title="1. Text File Input" %}
{% hint style="info" %}

#### Text File Input

Use **Text file input** to read the raw lines from `orders.txt`. Treat each line as a single string field for now.
{% endhint %}

1. Start Pentaho Data Integration (Spoon).

{% hint style="info" %}
{% tabs %}
{% tab title="Windows (PowerShell)" %}

```powershell
Set-Location C:\Pentaho\design-tools\data-integration
.\spoon.bat
```

{% endtab %}

{% tab title="macOS / Linux" %}

```bash
cd ~/Pentaho/design-tools/data-integration
./spoon.sh
```

{% endtab %}
{% endtabs %}
{% endhint %}

2. In the **Design** tab, expand the `Input` category.
3. Drag **Text file input** onto the canvas.

{% hint style="info" %}
Tip: You can also search for `Text file input`.
{% endhint %}

4. Double-click the step. Configure the file path:

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-b5a255018531e803176d9456818748c451c9e091%2Ftfi%20-read%20data.png?alt=media" alt=""><figcaption><p>Add path to file</p></figcaption></figure>

{% hint style="info" %}
Because the sample file is located in the same directory where the transformation resides, a good approach to naming the file in a way that is location independent is to use a system variable to parameterize the directory name where the file is located. In our case, the complete filename is:

`${Internal.Transformation.Filename.Directory}/orders.txt`
{% endhint %}

5. Select the **Content** tab. Configure it like this:

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-5eef1d4b7d77074040c20491038fbc2780ccc1aa%2Ftfi.png?alt=media" alt="" width="563"><figcaption><p>Text file input - Content</p></figcaption></figure>

6. Select **Fields**. Select **Get Fields**.

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-6a640aca5054dcb1fa89666697e54024a3788ce5%2Ftfi%20-fields.png?alt=media" alt="" width="563"><figcaption><p>Text File input - Fields</p></figcaption></figure>

{% hint style="info" %}
The step returns one field named `Field1`. It has type **String**.
{% endhint %}

7. Optional: rename the step to **Read orders**.
8. Select **OK**.
   {% endtab %}

{% tab title="2. Row flattener" %}
{% hint style="info" %}

#### Row Flattener

Use **Flattener** to turn repeating lines into a single output row.
{% endhint %}

1. Drag **Flattener** onto the canvas.
2. Create a hop from **Read orders**.
3. Double-click the step. Configure it like this:

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-42e592ece29140924c3ac12d8fd545e9b9c72fe5%2Fflattner.png?alt=media" alt="" width="375"><figcaption><p>Row flattener</p></figcaption></figure>

4. Optional: rename the step to **Flatten rows**.
5. Select **OK**.

{% hint style="info" %}
The data has now been flattened into records. This step enables you to define new target fields that match the number of repeating records. So Target field 1 will map to repeating record 1, and so on.
{% endhint %}
{% endtab %}

{% tab title="3. RegEx Evaluation" %}
{% hint style="info" %}

#### RegEx Evaluation

This step type allows you to match the String value of an input field against a text pattern defined by a regular expression. Optionally, you can use the regular expression step to extract substrings from the input text field matching a portion of the text pattern into new output fields. This is known as "capturing".

In our example, we’re going to extract and create two capture groups order\_status and order\_date based on the regex expression: (Delivered|Returned):(.+)
{% endhint %}

1. Drag **RegEx Evaluation** onto the canvas.
2. Create a hop from **Flatten rows**.
3. Double-click the step. Configure it like this:

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-80f9efcf17363ad71dae4923393424b143ee2b42%2Fregex.png?alt=media" alt=""><figcaption><p>RegEx Evaluation</p></figcaption></figure>

{% hint style="warning" %}
Set **Trim** to **both** for each field. This removes leading and trailing whitespace.
{% endhint %}

4. Optional: rename the step to **Parse status and date**.
5. Select **OK**.

***

{% hint style="info" %}
**Summary**

* This RegEx uses 2 constructs, denoted by the brackets, and separated by a full colon.
* `(Delivered|Returned)` matches either status.
* `(.+)` matches any character sequence.
* Use **Test RegEx** to verify capture groups.
  {% endhint %}

A good introduction can be found at:

{% embed url="<https://regex101.com/>" %}
Link to: Online RegEx engine
{% endembed %}
{% endtab %}

{% tab title="4. Replace in String" %}
{% hint style="info" %}

#### Replace in string

Replace in string is a simple search and replace. It also supports regular expressions and group references. Group references are picked up in the replace by string as $n where n is the number of the group.

Time to tidy up the order\_value stream field data. In this step, you replace the `Order Value:` label with an empty string.
{% endhint %}

1. Drag **Replace in string** onto the canvas.
2. Create a hop from **Parse status and date**.
3. Double-click the step. Configure it like this:

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-92a7dee714641c4d06c84e76a035d660db9e4d21%2Freplace.png?alt=media" alt=""><figcaption><p>Replace in String</p></figcaption></figure>

4. Optional: rename the step to **Clean order value**.
5. Select **OK**.

{% hint style="danger" %}
Use the exact label text, including the trailing space. If you enable regular expressions, use `Order Value:\s*`.
{% endhint %}
{% endtab %}

{% tab title="5. Select Values" %}
{% hint style="info" %}

#### Select values

The Select Values step is useful for selecting, removing, renaming, changing data types and configuring the length and precision of the fields on the stream. These operations are organized into different categories:

* Select and Alter — Specify the exact order and name in which the fields should be placed in the output rows
* Remove — Specify the fields that should be removed from the output rows
* Metadata — Change the name, type, length, and precision (the metadata) of one or more fields
  {% endhint %}

1. Drag **Select values** onto the canvas.
2. Create a hop from **Clean order value**.
3. Double-click the step. Configure it like this:

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-18d590942fdad6883c681c8c254bded7464a7733%2Fsv%20(2).png?alt=media" alt=""><figcaption><p>Select values</p></figcaption></figure>

| Fieldname    | Data Type | Format   |
| ------------ | --------- | -------- |
| order\_value | Number    | #.00     |
| order\_date  | Date      | MMM yyyy |

{% hint style="info" %}
Optional: rename the step to **Set data types**.
{% endhint %}
{% endtab %}

{% tab title="6. RUN" %}
{% hint style="info" %}

#### Run the transformation

Run the transformation locally.
{% endhint %}

1. Click the Run button in the Canvas Toolbar.
2. Select the **Preview data** tab.

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-8c345ac9c7b933cf6b6a25e96a83dcaa717db714%2Fresults.png?alt=media" alt=""><figcaption><p>Preview data</p></figcaption></figure>

{% hint style="success" %}
You should now have clean fields such as `order_date`, `order_status`, and `order_value`.
{% endhint %}
{% endtab %}
{% endtabs %}
