Text File Input

Onboarding text files ..

Workshop - Text File Input

Real-world data rarely arrives in perfect, structured formats ready for database loading. Organizations frequently receive orders, invoices, and other business documents as unstructured or semi-structured text files that require significant transformation before they can be analyzed. Learning to parse, cleanse, and structure these files is an essential skill for any data integration professional.

In this hands-on workshop, you'll work with Steel Wheels' order data delivered in a challenging text format. You'll build a complete transformation pipeline that takes messy, multi-line text records and converts them into clean, structured rows suitable for database insertion. This workshop introduces several powerful PDI steps for text manipulation, pattern matching, and data formatting- techniques you'll use repeatedly when integrating data from legacy systems, EDI feeds, or flat file exports.

What You'll Accomplish:

Configure the Text File Input step to read unstructured text data
Use the Flattener step to convert multi-line records into single rows
Apply Regular Expressions (RegEx) to extract specific data patterns and create capture groups
Implement the Replace in String step to remove unwanted text and formatting
Perform explicit data type conversions using the Select Values step
Format currency values and dates for proper database storage
Build a complete text processing pipeline from raw input to structured output

By the end of this workshop, you'll understand the multi-step process required to onboard flat files into database tables. You'll have practical experience with pattern matching, string manipulation, and data type conversion - core competencies that enable you to tackle even the most challenging text file formats. Instead of relying on manual data clean-up or complex pre-processing scripts, you'll build automated, repeatable transformations that handle messy data with confidence.

Prerequisites: Understanding of basic transformation concepts (steps, hops, preview); Pentaho Data Integration installed and configured

Estimated Time: 30 minutes

26KB

665B

Lets take a look at the data, which will give us an idea of how to approach a possible solution.

each line is a record
3rd line is 2 records: 'order status' & 'order date'
Order Value: in $
white space

So what do we need to do to get this into a database table?

Flatten rows
Extract values and associated with new data stream fields
String cut
Format fields - Date / Order Value

To create a new Transformation

Any one of these actions opens a new Transformation tab for you to begin designing your transformation.

By clicking File > New > Transformation
By using the CTRL-N hot key

Text File Input

The Text File Input step is used to read data from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.

The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

Start Pentaho Data Integration.

Windows - PowerShell:

Set-Location C:\Pentaho\design-tools\data-integration
.\spoon.bat

Linux:

cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh

Drag the ‘Text File Input’ step onto the canvas.
Double-click on the step, and configure the following properties:

Because the sample file is located in the same directory where the transformation resides, a good approach to naming the file in a way that is location independent is to use a system variable to parameterize the directory name where the file is located. In our case, the complete filename is:

${Internal.Transformation.Filename.Directory}/orders.txt