# Data Cleansing

{% hint style="info" %}
This workshop section focuses on demonstrating traditional data cleansing techniques using Pentaho Data Integration.

This dataset contains various issues:

* Duplicate records (John Doe, Alice Johnson)
* Inconsistent phone number formats
* Inconsistent date formats
* Missing values
* Inconsistent address formats
* Removing Duplicates

Your mission, should you wish to accept it .. is to build a workflow / pipeline that resolves the issues .. explain the decisions you have made and any suggest possible enhancements ..
{% endhint %}

```
CustomerID	FirstName	LastName	Email	                Phone	         BirthDate	Address
1	        John	        Doe	        john.doe@email.com	555-123-4567	 1985-03-15	123 Main St, City, CA, 12345
2	        Jane	        Smith	        jane.smith@email.com	(555) 987-6543	 03/22/1990	456 Elm Avenue, Town, AZ, 67890
3	        John	        Doe	        johnd@email.com	        5551234567	 1985-03-15	123 Main Street, City, CO, 12345
4	        Alice	        Johnson	        alice.j@email.com	555-555-5555	 1988-12-01	789 Oak Rd, Village, State, 54321
5	        Bob	        Williams	bob.w@email.com		                 1975-07-30	101 Pine Lane, Hamlet, State, 13579
6	        Emma	        Brown	        emma.brown@email.com	(555)246-8135	 05-19-1992	202 Cedar Blvd, Borough, NY, 24680
7	        Alice	        Johnson	        alice.johnson@email.com	555.555.5555	 12/01/1988	789 Oak Road, Village, FL, 54321
8	        Charlie	        Davis	        charlie.d@email.com	555-369-2587		        303 Maple Dr, City, State, 97531
9		                Taylor	        d.taylor@email.com	555-159-7532	 1982-09-25	404 Birch St, Town, State, 86420
10	        Grace	        Lee	        grace.lee@email.com	5557894561	 11-11-1995	505 Walnut Ave, City, State, 
...
```

{% hint style="info" %}
All the workshop files: ../Databases/CRUID/Workshop - Data Cleansing
{% endhint %}

{% tabs %}
{% tab title="1. Onboard Data" %}
{% hint style="info" %}
First step is to onboard the data .. currently its a CSV file - which could easily be onboarded - instead let's onboard into a table.

There's a number of databases already installed and configured, running as containers.
{% endhint %}

1. Log on to Portainer and check the MariaDB database container is up and running.

{% embed url="<https://localhost:9443/#!/2/docker/containers>" %}
Link to Docker Containers in Portainer
{% endembed %}

2. Execute the following script to create a sourceDB & targetDB databases.

'grant all' to pentaho\_user & pentaho\_admin with the password: 'password'.

```
CREATE DATABASE  IF NOT EXISTS sourceDB;
grant all on sourceDB.* to pentaho_user identified by 'password';
grant all on sourceDB.* to pentaho_admin identified by 'password';

USE sourceDB;

set session sql_mode=replace(@@sql_mode,'NO_ZERO_DATE','');
```

<figure><img src="/files/NZLPNCySNsKNMtln4xie" alt=""><figcaption><p>create sourceDB &#x26; targetDB</p></figcaption></figure>

***

**Transformation**

{% hint style="info" %}
Currently the customer records are in CSV format. It's optional, but, let's onboard into a database table as this is a more likely scenario.
{% endhint %}

<figure><img src="/files/eT7kmKmvzqR1uqDINIim" alt="" width="296"><figcaption><p>tr_onboard</p></figcaption></figure>

{% tabs %}
{% tab title="CSV File input" %}

1. Drag and drop a CSV File input onto the canvas.
2. Double-click to configure the following settings:

<figure><img src="/files/yC9EbNgfTJhPrIHmwZqR" alt=""><figcaption><p>CSV File input - customer_data.csv</p></figcaption></figure>

3. Increase the varchar (length) to prevent truncation.
4. After clicking on 'Get Fields', 'Preview' the data.

<figure><img src="/files/irIRzay0gtVIFek5gf9z" alt=""><figcaption><p>Preview data</p></figcaption></figure>
{% endtab %}

{% tab title="Table output" %}
{% hint style="info" %}
The MariaDB jdbc database driver has been copied to the /lib directory.
{% endhint %}

**Define Database Connection (MariaDB)**

1. In Spoon, click File > New > Transformation. Any one of these actions opens a new Transformation tab for you to begin designing your transformation:
   * By clicking New, then Transformation
   * By using the CTRL-N hot key
2. From within Spoon, Select:

File > New > Database Connection The Database Connection dialog box appears.

<figure><img src="/files/N756WJLwzYvAkNB7Gq9Y" alt=""><figcaption><p>MariaDB: sourceDB connection</p></figcaption></figure>

3. Enter the following details:

<table><thead><tr><th width="193">Section Name</th><th>Value</th></tr></thead><tbody><tr><td><strong>Connection Name</strong></td><td>MariaDB: sourceDB</td></tr><tr><td><strong>Connection Type</strong></td><td><mark style="color:red;"><strong>MySQL</strong></mark></td></tr><tr><td><strong>Host Name</strong></td><td>localhost</td></tr><tr><td><strong>Database Name</strong></td><td>sourceDB</td></tr><tr><td><strong>User name</strong></td><td>pentaho_admin</td></tr><tr><td><strong>Password</strong></td><td>password</td></tr></tbody></table>

4. Click Test.
5. Configure the Table ouput step with the following settings:

<figure><img src="/files/wvx52vGix0NoEOS1BMq9" alt="" width="473"><figcaption><p>create source_customer table</p></figcaption></figure>

6. Click on the SQL button & Execute.

<figure><img src="/files/0bOaEhRNMKZSlbKjcXvT" alt="" width="365"><figcaption><p>Execute SQL script</p></figcaption></figure>

7. Close all the windows & save / RUN transformation.

<figure><img src="/files/ouVGsRldplVaWUIhavo4" alt=""><figcaption><p>RUN Transformation</p></figcaption></figure>

8. Check the table has been created and populated with customer data.

<figure><img src="/files/O9lPHVdEw9nrhFoaVmmw" alt=""><figcaption><p>souurce_customer table</p></figcaption></figure>
{% endtab %}
{% endtabs %}
{% endtab %}

{% tab title="2. Data Cleansing" %}
{% hint style="info" %}
Lets' breakdown the data cleansing workflows that need to be applied:

Resolve FirstName
{% endhint %}

x

x

x

{% tabs %}
{% tab title="FirstName" %}
{% hint style="info" %}
Let's start with resolving the FirstName.

Luckily we access to the National Address Database (NAD), which is a aggregation of data provided by state, local, and tribal governments.
{% endhint %}

<figure><img src="/files/RGxmGt8K9bUUtyLdJhfy" alt=""><figcaption><p>Resolve FirstName</p></figcaption></figure>

**If field value is null**

{% hint style="info" %}
The step "If field value is null" is able to replace nulls by a given value either by

1. Processing the complete row with all fields
2. Processing the complete row but only for specific field types (Number, String, Date etc.)
3. Processing the complete row but only for specific fields by name

Changing a numeric field type to an empty string field type can cause an error in the subsequent steps. The error will not appear until the next step accesses that field.
{% endhint %}

<figure><img src="/files/l57Lg1jBwjNyWMQct0ql" alt=""><figcaption><p>Replace NULL values</p></figcaption></figure>

***

**Filter FN**

{% hint style="info" %}
Replacing NULL values with 'Unknown' enables the FirstName field to be filtered.

You could also modify the Filter condition to: NOT
{% endhint %}

<figure><img src="/files/NXFW3lS2k4rdSgPoLNqE" alt=""><figcaption><p>Filter: FirstName = Unknown</p></figcaption></figure>

***

**Stream lookup FN**

{% hint style="info" %}
If the condition is TRUE, i.e FirstName = Unknown then send the row to Stream lookup FN.
{% endhint %}

<figure><img src="/files/2zsBTCeMaGixwd9nwZbo" alt=""><figcaption><p>Lookup FirstName and set as ResolvedFisrtName</p></figcaption></figure>

{% hint style="info" %}
Ok .. I know this is a training lab ..!

The Data grid (FisrtName) has customer data that could be coming from another system.

Fingers crossed the CustomerID references the same customer.

The key fields - hashed - ensure that the row returned 'matches' the row in the data stream.
{% endhint %}

***

Select values FN

{% hint style="info" %}
As we know the FirstName = Unknown, however, based on the lookup criteria its matched Resolved FirstName = David to the record. To 'map' the ResolvedFirstName to the FirstName data stream field, simply rename ..

So .. now when you merge the record back into the 'main' stream
{% endhint %}

<figure><img src="/files/nIgMAIBVP9OZVScW8FIW" alt=""><figcaption></figcaption></figure>

x

x
{% endtab %}

{% tab title="Second Tab" %}
x
{% endtab %}
{% endtabs %}

x
{% endtab %}
{% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.pentaho.com/pentaho-data-integration/data-integration/data-sources/databases/cruid/data-cleansing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
