# Profiling

{% hint style="success" %}

#### Data Profiling

Before diving into data analytics within a Data Catalog, it's critical to ensure your data has been properly profiled. Data profiling is not just a step but a prerequisite. Without a valid data profile, proceeding to data identification tasks is premature.

**What is Data Profiling?**

Data profiling involves a detailed examination of your dataset, focusing on specific data objects. During this process, you collect valuable statistics and summaries that provide insights into the nature of your data. The beauty of data profiling is its immediacy. As soon as data from a column, table, or schema is processed, the results are ready for review.

When a data source schema is initially ingested, you're limited to the database metadata. Data profiling enriches this by uncovering additional layers of information, making it an indispensable tool in the data analyst's toolkit.
{% endhint %}

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2F2SBvCQjEsYLP6nv4QDUE%2Fimage.png?alt=media&#x26;token=6cfe159e-c44c-406a-a276-66602c35d915" alt=""><figcaption><p>Data Profiling</p></figcaption></figure>

{% hint style="warning" %}
If data profiling is not done, Data Catalog highlights as 'Required'.
{% endhint %}

1. Click the Data Profiling tile.

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FbKa9njQiteUKJSpTdjUx%2Fimage.png?alt=media&#x26;token=4621e2f9-230e-4777-a71f-cf636b9d46ff" alt=""><figcaption><p>Data Profiling</p></figcaption></figure>

The Profiling page opens with the following additional options to configure data profiling.

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FEOHr141VYzQdasaZR33p%2Fimage.png?alt=media&#x26;token=2ece1b3d-ee52-484c-99a0-878958296ee7" alt=""><figcaption><p>Profiling options</p></figcaption></figure>

{% hint style="info" %}
Data Catalog supports profiling of a wide range of file-based data assets. The following table highlights the major categories and commonly used file types that share a unified profiling interface and results:
{% endhint %}

<table><thead><tr><th width="169">Category</th><th width="176">File Types</th><th>Additional Information</th></tr></thead><tbody><tr><td>Structured files</td><td><code>.csv</code>, <code>.tsv</code>, <code>.psv</code></td><td>Structured files with consistent field delimiters. You can configure header row detection and delimiter type during profiling.</td></tr><tr><td>Compressed files</td><td><code>.gz</code>, <code>.snappy</code>, <code>.deflate</code>, <code>.bz2</code>, <code>.lzo</code>, <code>.lz4</code></td><td></td></tr><tr><td>Unstructured documents</td><td><code>.pdf</code>, <code>.doc</code>, <code>.docx</code>, <code>.txt</code>, <code>.rtf</code></td><td>Profiling extracts document metadata and textual content. Includes string detection, summarization, and duplicate detection.</td></tr><tr><td>Semi-structured files</td><td><code>.parquet</code> , <code>.json</code>, <code>.avro</code>, <code>.orc</code></td><td>Stores structured data in columnar format. Profiling includes schema detection, field types, null values, and value frequency analysis.</td></tr></tbody></table>

{% tabs %}
{% tab title="1. Profiling" %}
{% hint style="info" %}

#### Data Profiling

In the **Data Profiling** process, Data Catalog examines AW structured data within JDBC data sources and gathers statistics about the data. It profiles data using algorithms to compute detailed properties, including field-level data quality metrics, data statistics, and data patterns.

When configuring data profiling, it is considered a best practice to use the default settings as they are suitable for most situations. With the default settings, the data profiling is limited to 500,000 rows.
{% endhint %}

<table><thead><tr><th width="160">Field</th><th>Description</th></tr></thead><tbody><tr><td><strong>Extract samples</strong></td><td>Extracts the sample data during profiling and displays it in the summary tab.</td></tr><tr><td><strong>Skip Recent (days)</strong></td><td>Skips profiling for recently profiled tables. For example, if the days field is set to 7, any table profiled within the last 7 days is skipped.</td></tr><tr><td><strong>Sample Type</strong></td><td><p>Specifies the sampling method for profiling:</p><ul><li><strong>Sample Clause</strong>: Profiles the sample data based on the <strong>percentage</strong> or <strong>rows</strong>*.</li><li><strong>First N Rows</strong>: Profiles the first N rows of the data resource.</li><li><strong>Every Nth Row</strong>: Profiles every Nth row of the data resource.</li><li><p><strong>Filter</strong>: Profiles data using a custom SQL <code>WHERE</code> clause, that helps to target specific subsets of data based on user-defined conditions.</p><ul><li><strong>Where Clause</strong>#: SQL condition used for data selection when Filter is enabled. For example, <code>country = 'USA'.</code></li></ul></li><li><strong>Clear</strong>: Resets the sampling configuration.</li></ul></td></tr><tr><td><strong>Split Job by Columns</strong></td><td>Splits profiling jobs by columns, allowing parallel processing for wide tables.</td></tr><tr><td><strong>Columns Per Job</strong></td><td>When splitting by columns is enabled, specifies the number of columns included in each job.</td></tr><tr><td><strong>Number of Tables Per Job</strong></td><td>Specifies the number of tables included in a single profiling job.</td></tr><tr><td><strong>Persist Threads</strong></td><td>Defines the number of threads used for persisting profiling results to improve performance.</td></tr><tr><td><strong>Persist File Threads</strong></td><td>Sets the number of threads for persisting profiling data into files for large datasets.</td></tr><tr><td><strong>Profile Threads</strong></td><td>Indicates the number of threads allocated for profiling tasks, enabling parallel task execution.</td></tr></tbody></table>

{% hint style="warning" %}
The **Sample Clause > Rows** option is only supported for: Microsoft SQL & Snowflake data sources.
{% endhint %}

***

1. Enable: Extract Samples.
2. Set Sample Type: Percentage (30)

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FPERuTdWJrZNbr02NJM8G%2Fimage.png?alt=media&#x26;token=1262e1d3-8463-4958-9cc4-60693772ebae" alt=""><figcaption><p>Profiling Options</p></figcaption></figure>

{% hint style="warning" %}
Keep the other default values ..

The Profiling will use Reservoir Sampling 30% of total records.
{% endhint %}

Extract Samples box in Profiling and the Profiling section of data discovery is selected by default so that user can initiate a profiling exercise and do not have to remember to tick this box.

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FfmSMGvNhIarSnFRodqfd%2Fimage.png?alt=media&#x26;token=daeef0c5-a6c0-4891-bbd1-8a3431cb7e19" alt=""><figcaption></figcaption></figure>

3. Click: 'Start'.&#x20;

You can You can view the status of the **Profiling** process on the Workers page.

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FrtNoY02RimSW2lhwjej5%2Fimage.png?alt=media&#x26;token=c9faf9f7-c016-4346-b3ac-7c4edcc0db7d" alt=""><figcaption><p>Workers - Profiling</p></figcaption></figure>

{% hint style="info" %}
The dataset has now been scanned which returns the metadata properties along with some additional details.

Next stage: Explore the data to confirm the details.&#x20;
{% endhint %}
{% endtab %}

{% tab title="2. Profiled - Schemas & Tables" %}
{% hint style="info" %}

#### Profiled Data

If you select a data source, you can view available items like schema for structured data and folders for file systems. In addition, you can also get the number of tables and columns, including associated tags. For schemas, you can get additional information like row count and the last profiled date and time.
{% endhint %}

1. Navigate to: Person Schema
2. Enter a description:&#x20;

<table><thead><tr><th width="216">Schema</th><th width="361">Description</th><th>Sensitivity</th></tr></thead><tbody><tr><td>HumanResources</td><td>Contains tables related to employee information, departmental structures, payroll history, and job candidates for the fictional company's workforce management.</td><td>HIGH</td></tr><tr><td>Person</td><td>Stores information about human beings involved with Adventure Works including employees, customer contacts, and vendor contacts, along with their addresses, phone numbers, and email information.</td><td>HIGH</td></tr><tr><td>Production</td><td>Manages data related to products sold or used in manufacturing, including product details, bills of materials, work orders, and manufacturing processes.</td><td>MEDIUM</td></tr><tr><td>Purchasing</td><td>Handles vendor relationships, purchase orders, product sourcing, and supplier management for the company's procurement operations.</td><td>MEDIUM</td></tr><tr><td>Sales</td><td>Contains all sales-related data including customer information, sales orders, sales territories, sales representatives, credit cards, and revenue tracking.</td><td>HIGH</td></tr></tbody></table>

3. Data Discovery -  also determined the level of Sensitivity. Set the level for each schema.
4. Data Discovery - also calculated a Confidence Score - set the number of stars for each Schema.

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FKMCPxLnBcBcehPO1VCFm%2Fimage.png?alt=media&#x26;token=a6b32062-54c5-4f73-839d-2c82633b6137" alt=""><figcaption><p>Schemas</p></figcaption></figure>

{% hint style="info" %}
System Information: highlights when the Schema was last scanned.

Properties: 15 columns
{% endhint %}

2. Click: Details tab.

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2F9eAdsjrCddmOqyfMyBqn%2Fimage.png?alt=media&#x26;token=d29524be-ecf9-453a-8151-6f49a47a27b5" alt=""><figcaption><p>Schema - Person details</p></figcaption></figure>

{% hint style="info" %}
Information on the number of Rows and Data Profiling date have been calculated.
{% endhint %}

Let's dig a bit deeper and see what other stats are available.

3. Expand the Person schema > Person table.
4. Select: Summary tab.
5. Enter a description (only 70 tables left):

```
The Person.Person table stores fundamental personal information for all human beings involved with Adventure Works, including employees, customer contacts, and vendor contacts, serving as the central repository for individual identity data such as names, titles, and demographic information that links to other business entities throughout the database.
```

6. Verify the Data Lineage.

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FTqI6xYsGnJhfmlIUMgvB%2Fimage.png?alt=media&#x26;token=2e17ba63-094c-42cc-a0dd-5b488100a7f5" alt=""><figcaption><p>Table - Person.Person</p></figcaption></figure>

{% hint style="info" %}
The Summary tab displays some extra statistics;

Last Successful Profiled date

Statistics - 13 Columns & 5,881 Rows

PDC Lineage - sqlserver: adventureworks2022 > Sales.SalesPerson table
{% endhint %}

7. Click on: Details

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FODaMi1KXfihCKAYqU9qZ%2Fimage.png?alt=media&#x26;token=32e3e081-2465-4a05-b761-941942ba285a" alt=""><figcaption></figcaption></figure>

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2F4sRYIKQwdWAO7kvzIRmt%2Fimage.png?alt=media&#x26;token=0591e86a-a1b3-4308-aeb9-59cb0892e949" alt=""><figcaption><p>SalesPerson table details</p></figcaption></figure>

{% hint style="info" %}
PK:FK keys identified.

Next: Profiled Columns - Data Patterns
{% endhint %}
{% endtab %}

{% tab title="3. Profiled - Columns" %}
{% hint style="info" %}

#### Sampled Data

{% endhint %}

1. In data Canvas, Select: Person > Person.
2. Expand the Person table > FirstName.

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FlJxI4yA5rErkM7ew9yEI%2Fimage.png?alt=media&#x26;token=74a960a8-9659-42f3-8db0-58c620246607" alt=""><figcaption><p>Sales - Person.Person - FirstName column</p></figcaption></figure>

{% hint style="info" %}

#### Statistics

The following table define the data profiling terms:
{% endhint %}

<table><thead><tr><th width="210">Statistic</th><th>Description</th></tr></thead><tbody><tr><td>Stdev Length</td><td><p>is the <strong>number of data points</strong> (or time periods) used to calculate the standard deviation. </p><ul><li><strong>Stdev Length of 10</strong> = calculate standard deviation using the last 10 values</li></ul></td></tr><tr><td>Blanks</td><td>Empty cells that contain no characters (not even a space). Count of cells with literally nothing in them.</td></tr><tr><td>Avg Length</td><td>Average number of characters in the values. For example, if you have "Cat" (3), "Dog" (3), "Bird" (4), the average length is 3.33.</td></tr><tr><td>Bytes</td><td>Total storage size of the data in bytes. Shows how much memory/disk space the data occupies.</td></tr><tr><td>Min Length</td><td>Shortest value in the column by character count. If your shortest entry is "NY" that's 2 characters</td></tr><tr><td>Nulls</td><td>Missing or undefined values (database NULL). Different from blanks - these are explicitly marked as "no data."</td></tr><tr><td>Rows</td><td>Total number of records/entries in the dataset.</td></tr><tr><td>Cardinality</td><td>Number of unique/distinct values. If a column has 1000 rows but only 5 different values, cardinality = 5. High cardinality = many unique values; low cardinality = few unique values.</td></tr><tr><td>Max Length</td><td>Longest value in the column by character count. If your longest entry is "Massachusetts" that's 13 characters.</td></tr></tbody></table>
{% endtab %}
{% endtabs %}
