Profiling

Data Profiling

  1. Click the Data Profiling tile.

Data Profiling

The Profiling page opens with the following additional options to configure data profiling.

Profiling options

Data Catalog supports profiling of a wide range of file-based data assets. The following table highlights the major categories and commonly used file types that share a unified profiling interface and results:

Category
File Types
Additional Information

Structured files

.csv, .tsv, .psv

Structured files with consistent field delimiters. You can configure header row detection and delimiter type during profiling.

Compressed files

.gz, .snappy, .deflate, .bz2, .lzo, .lz4

Unstructured documents

.pdf, .doc, .docx, .txt, .rtf

Profiling extracts document metadata and textual content. Includes string detection, summarization, and duplicate detection.

Semi-structured files

.parquet , .json, .avro, .orc

Stores structured data in columnar format. Profiling includes schema detection, field types, null values, and value frequency analysis.

Data Profiling

In the Data Profiling process, Data Catalog examines AW structured data within JDBC data sources and gathers statistics about the data. It profiles data using algorithms to compute detailed properties, including field-level data quality metrics, data statistics, and data patterns.

When configuring data profiling, it is considered a best practice to use the default settings as they are suitable for most situations. With the default settings, the data profiling is limited to 500,000 rows.

Field
Description

Extract samples

Extracts the sample data during profiling and displays it in the summary tab.

Skip Recent (days)

Skips profiling for recently profiled tables. For example, if the days field is set to 7, any table profiled within the last 7 days is skipped.

Sample Type

Specifies the sampling method for profiling:

  • Sample Clause: Profiles the sample data based on the percentage or rows*.

  • First N Rows: Profiles the first N rows of the data resource.

  • Every Nth Row: Profiles every Nth row of the data resource.

  • Filter: Profiles data using a custom SQL WHERE clause, that helps to target specific subsets of data based on user-defined conditions.

    • Where Clause#: SQL condition used for data selection when Filter is enabled. For example, country = 'USA'.

  • Clear: Resets the sampling configuration.

Split Job by Columns

Splits profiling jobs by columns, allowing parallel processing for wide tables.

Columns Per Job

When splitting by columns is enabled, specifies the number of columns included in each job.

Number of Tables Per Job

Specifies the number of tables included in a single profiling job.

Persist Threads

Defines the number of threads used for persisting profiling results to improve performance.

Persist File Threads

Sets the number of threads for persisting profiling data into files for large datasets.

Profile Threads

Indicates the number of threads allocated for profiling tasks, enabling parallel task execution.

The Sample Clause > Rows option is supported only for the Microsoft SQL and Snowflake data sources.


  1. Enable: Extract Samples.

  2. Set Sample Type: Percentage (30)

Profiling Options
  1. Click: 'Start'.

You can You can view the status of the Profiling process on the Workers page.

Workers - Profiling

The dataset has now been scanned which returns the metadata properties along with some additional details.

Next stage: Explore the data to confirm the details.

Last updated

Was this helpful?