SCDs
Slowly Changing Dimensions ..
Workshop - Slowly Changing Dimensions (SCDs)
Data warehousing and dimensional modeling require sophisticated strategies for tracking how dimension attributes change over time—from customer addresses and employee job titles to product classifications and organizational structures. Organizations face a fundamental challenge: business dimensions evolve continuously, but historical analyses must reflect the correct attribute values as they existed at specific points in time. Understanding how to implement Slowly Changing Dimensions (SCDs) ensures your data warehouse accurately captures historical context, enabling time-based analysis that reflects reality rather than just current state.
In this hands-on workshop, you'll learn to use PDI's "Dimension Lookup/Update" step to implement Ralph Kimball's Type 1 and Type 2 slowly changing dimension methodologies. Steel Wheels maintains dimensional data about cities and locations that change over time—from correcting data entry errors (like "Pariss" to "Paris") to capturing legitimate business changes that require historical tracking. You'll configure transformations that automatically determine whether changes should overwrite existing values (Type 1) or create new versioned records with effective dates (Type 2), and you'll understand how to implement hybrid approaches that combine both strategies based on attribute-specific requirements.
What You'll Accomplish:
Configure the Dimension Lookup/Update step for dimensional table maintenance
Implement Type 1 SCD methodology (overwrite) for data corrections and non-historical changes
Implement Type 2 SCD methodology (versioning) for historical change tracking with effective dates
Define lookup keys that map stream records to dimension table rows (natural keys vs technical keys)
Configure field mapping strategies: Insert, Update, and Punch Through options
Understand technical key (TK) generation and surrogate key management including TK 0 for unknown values
Work with version tracking, effective dates (date_from, date_to), and current indicators
Implement automatic timestamp maintenance for last insert and update operations
Create hybrid SCDs that combine Type 1 and Type 2 strategies for different attributes
Use Dimension Lookup/Update step in lookup mode for dimension enrichment
By the end of this workshop, you'll have practical experience implementing enterprise-grade slowly changing dimension patterns that preserve historical accuracy while accommodating business change. You'll understand the critical difference between Type 1 (overwrites, no history) and Type 2 (versions, full history) approaches, when each is appropriate, and how to combine them intelligently based on attribute characteristics and business requirements. Rather than manually maintaining complex versioning logic or accepting data warehouse designs that lose historical context, you'll build native PDI solutions that automatically track dimension changes with proper surrogate keys, effective dating, and version control—ensuring your analytical reports reflect accurate historical context for time-based analysis.
Prerequisites: Understanding of dimensional modeling concepts (facts, dimensions, surrogate keys), familiarity with data warehouse design principles, database connection configuration; Pentaho Data Integration installed and configured with appropriate database connections established
Estimated Time: 40 minutes
Create a new Transformation
Any one of these actions opens a new Transformation tab for you to begin designing your transformation.
By clicking File > New > Transformation
By using the CTRL-N hot key

SCD table
First step in these series of workshops, is to create a simple DIM_SCD table that has the required fields to illustrate a Type 1 change - just overwrite the value - and Type2 - where you need to record when any change in the value occurs.
Create the Table in MySQL sampledata database.
In your DB management tool, execute the following statement.

Data grid
This step allows you to enter a static list of rows in a grid. This is usually done for testing, reference or demo purposes.
Drag the Data Grid step onto the canvas.
Open the Data Grid properties dialog box.
Ensure the following details are configured, as outlined below:

On the Data tab enter the following values:

Get system info
The Get System Info step retrieves information from the Kettle environment.
This step generates a single row with the fields containing the requested information. It also accepts input rows.
Drag the Get System Info step onto the canvas.
Open the Get System Info properties dialog box.
Ensure the following details are configured, as outlined below:

Dimension lookup/update
The Dimension Lookup/Update step allows you to implement Ralph Kimball's slowly changing dimension for both types: Type I (update) and Type II (insert) together with some additional functions. Not only can you use this step to update a dimension table, it may also be used to look up values in a dimension.
Drag the Dimension lookup/update step onto the canvas.
Open the Dimension lookup/update properties dialog box.
Ensure the following details are configured, as outlined below:
x
Lookup
This step is rarely used for lookups as specific steps exists; however, it does illustrate the database process of how referential integrity is maintained.
Add ‘London’ to the Data Grid.

To initially configure the fields set the step to Update mode.
Once completed remember to reset to lookup mode.

Ensure that the Dimension Lookup / Update is set to: Lookup Mode.

Sets the step to Update Mode with lookup keys: id
There’s no point adding the LAST_UPDATE field as we're dealing with Type 1 changes.
Click on the Fields tab.

RUN the Transformation
x
Click the Run button in the Canvas Toolbar.
Click on the Dimension lookup/update step and Preview the data.

Why does London have a TK 0 and a value of NULL?
To maintain referential integrity, the record is assigned a TK 0, and as it doesn’t exist in the database, the value returned is NULL. As this is a lookup, no record is written to the database table.
The Dimension Lookup/Update step allows you to implement Ralph Kimball's slowly changing dimension for both types:
Type 1
Overwriting the old value. In this method, no history of dimension changes is kept in the database. The old dimension value is simply overwritten with the new one. This type is easy to maintain and is often use for data which changes are caused by processing corrections (e.g. removal special characters, correcting spelling errors).
Open the Dimension Lookup / Update properties dialog box.
Ensure the following details are configured, as outlined below:

Click on the Get Fields button.
This will add the key fields used in the Lookup.
The keys that map the records (rows) in the dimension table to the stream field is: id
Click on the Fields tab and map the stream name field to the dimension name field and ensure:

There are several options available to insert / update the dimension record (row).
Insert:
This option implements a Type I & II slowly changing dimension policy. If the difference is detected for one or more mappings that have the Insert option, then a row is added to the dimension table.
Update:
This option simply updates the matched row. It can be used to implement a Type I slowly changing dimension.
Punch through:
The punch through option also performs an update. But instead of only updating the matched dimension row, it will update all versions of the row in a Type II slowly changing dimension.
Date of last insert or update (without stream field as source):
Use this option to let the step automatically maintain a date field that records the date of the insert or update using the system date field as source.
Date of last insert (without stream field as source):
Use this option to let the step automatically maintain a date field that records the date of the last insert using the system date field as source.
Date of last update (without stream field as source):
Use this option to let the step automatically maintain a date field that records the date of the last update using the system date field as source.
Last version (without stream field as source):
Use this option to let the step automatically maintain a flag that indicates if the row is the last version.
RUN Transformation
At the moment its all Type I changes .. Insert / Update record and Update when that record change occurred.
Click the Run button in the Canvas Toolbar.
Click on the Preview tab:

Examine the table in your SQL Query Tool:

So what's happening behind TK 0?
Kettle automatically inserts an additional record with a technical key of value 0 (for default or unknown values). This will only happen in the first execution. Below this record, you find the one record (London) from our sample dataset.
In update mode (update option is enabled) the step first performs a lookup of the dimension entry. The result of the lookup is different though. Not only the technical key is retrieved from the query, but also the dimension attribute fields. A field-by-field comparison then follows. The result can be one of the following situations:
The record was not found, we insert a new record in the table.
The record was found and any of the following is true:
One or more attributes were different and had an "Insert" (Kimball Type II) setting: A new dimension record version is inserted.
One or more attributes were different and had a "Punch through" (Kimball Type I) setting: These attributes in all the dimension record versions are updated.
One or more attributes were different and had an "Update" setting: These attributes in the last dimension record version are updated.
All the attributes (fields) were identical: No updates or insertions are performed.
If you mix Insert, Punch Through and Update options in this step, this algorithm acts like a Hybrid Slowly Changing Dimension. (it is no longer just Type I or II, it is a combination)
Try-out different Scenarios
Double-click on the Data Grid step.
Add the following City: Madrid

Double-click on the Dimension lookup/update step.
Click on th Fields tab and set the following:

Execute the transformation and view the results.

Note that if the record doesnt exisit (Madrid) then it gets inserted. Both records are updated with the same last_update timestamp.
The Dimension Lookup/Update step allows you to implement Ralph Kimball's slowly changing dimension for both types:
Type 2
Creating a new additional record. In this methodology, all history of dimension changes are kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key (or another durable identifier). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with current indicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in Type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.
Double-click on the Data Grid step.
Add the following City: Pariss (incorrectly spelt 😄)

Keep the Fields strategy the same:

RUN Transformation
At the moment its all Type I changes .. Insert / Update record and Update when that record change occurred.
Click the Run button in the Canvas Toolbar.
Click on the Preview tab:

Examine the table in your SQL Query Tool:
Execute the transformation and view the results.

As expected the Pariss record is inserted. Notice that the natural primary keys (id) are the same as the TK (Technical Key) and we're on version 1 for each record.
Let's now do a bit of data entry .. Correct the entry: Pariss to Paris
Try out different Strategies
The workflows below will give you an idea of the different strategies that can be implemented.
Any records that are changed are:
Archived with date_from to date_to timestamp.
Record is assigned Version 1. New record Version 2
Note the Keys.
Double-click on the Data Grid step.
Edit the Pariss value to: Paris

Double-click on the Dimension lookup/update step.
Click on th Fields tab and set the following:
The reason for selecting Insert is obvious.. you need to insert a new record that tracks the change. If you select Update, then a Type 1 change occurs; i.e. the original record is updated.

RUN the Transformation
Click the Run button in the Canvas Toolbar.
Examine the table in your SQL Query Tool:.

A new record has been inserted, with the value ‘Paris’, with an updated date_from and last_update timestamp, and version.
Notice that the LAST_UPDATE field has also been updated for the other records.
Punch through:
The punch through option also performs an update. But instead of only updating the matched dimension row, it will update all versions of the row in a Type II slowly changing dimension.
So lets update all versions to Paris.
Lets start with a clean table to clearly illustrate the results. Truncate the table.
Repeat the workflow above and check the results.
Ensure you have set 'Pariss' as the value for city in the data grid step.
Set update as your initial strategy.

Double-click on the Data Grid step.
Edit the Pariss value to: Paris

Data grid - edit Paris Double-click on the Dimension lookup/update step.
Click on th Fields tab and set the following:

RUN the Transformation
Click the Run button in the Canvas Toolbar.
Examine the table in your SQL Query Tool:

Punch through
Edit the Fields tab in the Dimension Lookuo/update step to Punch through:

RUN the Transformation
Click the Run button in the Canvas Toolbar.
Examine the table in your SQL Query Tool:

So whats happened ..
Remember a Punch through updates the fields where the records match ..
All the records have a last_update field .. so this will be updated
As both our Pariss & Paris records have a matching key id=3 then the archived record will be updated with the current city value .. Paris
Last updated
Was this helpful?
