Text File Input
Onboarding text files ..
Workshop - Text File Input
Real-world data rarely arrives in perfect, structured formats ready for database loading. Organizations frequently receive orders, invoices, and other business documents as unstructured or semi-structured text files that require significant transformation before they can be analyzed. Learning to parse, cleanse, and structure these files is an essential skill for any data integration professional.
In this hands-on workshop, you'll work with Steel Wheels' order data delivered in a challenging text format. You'll build a complete transformation pipeline that takes messy, multi-line text records and converts them into clean, structured rows suitable for database insertion. This workshop introduces several powerful PDI steps for text manipulation, pattern matching, and data formatting- techniques you'll use repeatedly when integrating data from legacy systems, EDI feeds, or flat file exports.
What You'll Accomplish:
Configure the Text File Input step to read unstructured text data
Use the Flattener step to convert multi-line records into single rows
Apply Regular Expressions (RegEx) to extract specific data patterns and create capture groups
Implement the Replace in String step to remove unwanted text and formatting
Perform explicit data type conversions using the Select Values step
Format currency values and dates for proper database storage
Build a complete text processing pipeline from raw input to structured output
By the end of this workshop, you'll understand the multi-step process required to onboard flat files into database tables. You'll have practical experience with pattern matching, string manipulation, and data type conversion - core competencies that enable you to tackle even the most challenging text file formats. Instead of relying on manual data clean-up or complex pre-processing scripts, you'll build automated, repeatable transformations that handle messy data with confidence.
Prerequisites: Understanding of basic transformation concepts (steps, hops, preview); Pentaho Data Integration installed and configured
Estimated Time: 30 minutes


So what do we need to do to get this into a database table?
Text File Input
The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.
The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.
Start Pentaho Data Integration.
Drag the ‘Text File Input’ step onto the canvas.
Double-click on the step, and configure the following properties:

Click on the ‘Content’ tab and configure the following properties:

Click on ‘Get Fields’ button.
Click on the ‘Fields’ tab and notice the following properties:

Close the Step.
RegEx Evaluation
This step type allows you to match the String value of an input field against a text pattern defined by a regular expression. Optionally, you can use the regular expression step to extract substrings from the input text field matching a portion of the text pattern into new output fields. This is known as "capturing".
In our example, we’re going to extract and create two capture groups order_status and order_date based on the regex expression: (Delivered|Returned):(.+)
Drag the ‘RegEx Evaluation’ step on to the canvas.
Create a hop from the ‘flatten rows’ step.
Double-click on the step, and configure the following properties:

You will also need to set Trim: both for each field. This will ensure all white space is removed and the exact length of the field is returned.
Close the step.
A good introduction can be found at:
Replace in string
Replace in string is a simple search and replace. It also supports regular expressions and group references. Group references are picked up in the replace by string as $n where n is the number of the group.
Time to tidy up the order_value stream field data. In this step, you replace the Order Value: with ‘nothing’.
Drag the ‘Replace in String’ step onto the canvas.
Create a hop from the ‘parse delivered’ step.
Double-click on the step, and configure the following properties:

Close the step.
Ensure you have correctly entered the Search: Order Value: $[white space here]
Select values
The Select Values step is useful for selecting, removing, renaming, changing data types and configuring the length and precision of the fields on the stream. These operations are organized into different categories:
Select and Alter — Specify the exact order and name in which the fields should be placed in the output rows
Remove — Specify the fields that should be removed from the output rows
Meta-data - Change the name, type, length and precision (the metadata) of one or more fields
Drag the Select values step onto the canvas.
Create a hop from the ‘discard texts’ step.
Double-click on the step, and configure the following properties:

order_value
Number
#.00
order_date
Date
MMM yyy
Last updated
Was this helpful?



