Text File Input
Ingest semi-structured text files into clean rows.
Workshop - Text File Input
Real-world data rarely arrives in perfect, structured formats ready for database loading. Organizations frequently receive orders, invoices, and other business documents as unstructured or semi-structured text files that require significant transformation before they can be analyzed. Learning to parse, cleanse, and structure these files is an essential skill for any data integration professional.
In this hands-on workshop, you'll work with Steel Wheels' order data delivered in a challenging text format. You'll build a complete transformation pipeline that takes messy, multi-line text records and converts them into clean, structured rows suitable for database insertion. This workshop introduces several powerful PDI steps for text manipulation, pattern matching, and data formatting techniques you'll use repeatedly when integrating data from legacy systems, EDI feeds, or flat file exports.
What you'll do
Configure the Text File Input step to read unstructured text data
Use the Flattener step to convert multi-line records into single rows
Apply Regular Expressions (RegEx) to extract specific data patterns and create capture groups
Implement the Replace in String step to remove unwanted text and formatting
Perform explicit data type conversions using the Select Values step
Format currency values and dates for proper database storage
Build a complete text processing pipeline from raw input to structured output
By the end of this workshop, you'll understand the multi-step process required to onboard flat files into database tables. You'll have practical experience with pattern matching, string manipulation, and data type conversion - core competencies that enable you to tackle even the most challenging text file formats. Instead of relying on manual data clean-up or complex pre-processing scripts, you'll build automated, repeatable transformations that handle messy data with confidence.
Prerequisites: Understanding of basic transformation concepts (steps, hops, preview); Pentaho Data Integration installed and configured
Estimated time: 30 minutes
Workshop files
Download the following files.
Keep the filenames unchanged.
Save them in your workshop folder.
Create a new transformation
Use any of these options to open a new transformation tab:
Select File > New > Transformation
Use
Ctrl+N(Windows/Linux) orCmd+N(macOS)

Review the input file first. It will guide your parsing approach.

What to notice:
Each order spans multiple lines.
Line 3 contains two values: order status and order date.
Order value includes a currency symbol ($).
There is inconsistent whitespace.
Approach
You will:
Flatten multi-line records into a single row.
Extract values into new fields (capture groups).
Clean strings (remove labels and currency symbols).
Set data types and formats (date and number).
Text File Input
Use Text file input to read the raw lines from orders.txt. Treat each line as a single string field for now.
Start Pentaho Data Integration (Spoon).
In the Design tab, expand the
Inputcategory.Drag Text file input onto the canvas.
Tip: You can also search for Text file input.
Double-click the step. Configure the file path:

Because the sample file is located in the same directory where the transformation resides, a good approach to naming the file in a way that is location independent is to use a system variable to parameterize the directory name where the file is located. In our case, the complete filename is:
${Internal.Transformation.Filename.Directory}/orders.txt
Select the Content tab. Configure it like this:

Select Fields. Select Get Fields.

The step returns one field named Field1. It has type String.
Optional: rename the step to Read orders.
Select OK.
Drag Flattener onto the canvas.
Create a hop from Read orders.
Double-click the step. Configure it like this:

Optional: rename the step to Flatten rows.
Select OK.
The data has now been flattened into records. This step enables you to define new target fields that match the number of repeating records. So Target field 1 will map to repeating record 1, and so on.
RegEx Evaluation
This step type allows you to match the String value of an input field against a text pattern defined by a regular expression. Optionally, you can use the regular expression step to extract substrings from the input text field matching a portion of the text pattern into new output fields. This is known as "capturing".
In our example, we’re going to extract and create two capture groups order_status and order_date based on the regex expression: (Delivered|Returned):(.+)
Drag RegEx Evaluation onto the canvas.
Create a hop from Flatten rows.
Double-click the step. Configure it like this:

Set Trim to both for each field. This removes leading and trailing whitespace.
Optional: rename the step to Parse status and date.
Select OK.
Summary
This RegEx uses 2 constructs, denoted by the brackets, and separated by a full colon.
(Delivered|Returned)matches either status.(.+)matches any character sequence.Use Test RegEx to verify capture groups.
A good introduction can be found at:
Replace in string
Replace in string is a simple search and replace. It also supports regular expressions and group references. Group references are picked up in the replace by string as $n where n is the number of the group.
Time to tidy up the order_value stream field data. In this step, you replace the Order Value: label with an empty string.
Drag Replace in string onto the canvas.
Create a hop from Parse status and date.
Double-click the step. Configure it like this:

Optional: rename the step to Clean order value.
Select OK.
Use the exact label text, including the trailing space. If you enable regular expressions, use Order Value:\s*.
Select values
The Select Values step is useful for selecting, removing, renaming, changing data types and configuring the length and precision of the fields on the stream. These operations are organized into different categories:
Select and Alter — Specify the exact order and name in which the fields should be placed in the output rows
Remove — Specify the fields that should be removed from the output rows
Metadata — Change the name, type, length, and precision (the metadata) of one or more fields
Drag Select values onto the canvas.
Create a hop from Clean order value.
Double-click the step. Configure it like this:

order_value
Number
#.00
order_date
Date
MMM yyyy
Optional: rename the step to Set data types.
Last updated
Was this helpful?


