# Data Sources

### Choose a data source

Use this page to orient yourself. Then jump into the specific connector docs:

{% hint style="info" %}

#### What “data source” means in PDI

In practice, a data source is either:

* A file format you parse (CSV, Excel, JSON, XML).
* A service you connect to (a DB, object store, cluster, or API).
  {% endhint %}

{% tabs %}
{% tab title="Flat Files" %}
{% hint style="info" %}

#### Flat files

Use flat files when your data arrives as CSV, TXT, fixed-width, JSON, or XML.

Start here: [Flat Files](https://academy.pentaho.com/pentaho-data-integration/data-integration/data-sources/flat-files).
{% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-d9202e5264254b82fb470468c4b9b37cef40bfbc%2Fflat_files.png?alt=media" alt=""><figcaption></figcaption></figure>

{% tabs %}
{% tab title="Structured" %}
{% hint style="info" %}

#### Structured

Structured data uses a predefined model. It is easy to validate and query.

Think tables, rows, and columns. Examples include SQL databases and well-formed CSV files.
{% endhint %}
{% endtab %}

{% tab title="Unstructured" %}
{% hint style="info" %}

#### Unstructured

Unstructured data has no consistent schema. Examples include PDFs, images, video, and free-form text.

You typically need parsing, extraction, or ML to use it.
{% endhint %}
{% endtab %}

{% tab title="Semi-structured" %}
{% hint style="info" %}

#### Semi-structured

Semi-structured data has a loose schema. It uses tags or keys to describe fields and hierarchy.

Common formats are JSON and XML.
{% endhint %}
{% endtab %}

{% tab title="Metadata" %}
{% hint style="info" %}

#### Metadata

Metadata is “data about data”. Examples include headers, schemas, and data dictionaries.
{% endhint %}
{% endtab %}
{% endtabs %}
{% endtab %}

{% tab title="Databases" %}
{% hint style="info" %}

#### Databases

Pentaho connects to databases primarily through JDBC drivers. These drivers are the main interface for database communication.

Start here: [Databases](https://academy.pentaho.com/pentaho-data-integration/data-integration/data-sources/databases).
{% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fc5ICaWIzr71oSSsj5XpQ%2Fimage.png?alt=media&#x26;token=9d47a74d-43d1-4221-8be0-b78df52e0151" alt=""><figcaption><p>Database Connection</p></figcaption></figure>
{% endtab %}

{% tab title="Storage" %}
{% hint style="info" %}

#### Storage

Storage sources are cloud or network repositories. Examples include Amazon S3, Azure Blob Storage, and Google Cloud Storage.

In PDI, you typically connect through VFS. You can read and write across hybrid environments.

Start here: [Storage](https://academy.pentaho.com/pentaho-data-integration/data-integration/data-sources/storage).
{% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FwndCXLBypoOvlRN2vZXV%2Fimage.png?alt=media&#x26;token=42a51335-a15c-4d07-87b5-ab0a8ea81312" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Big Data" %}
{% hint style="info" %}

#### Big Data

Big data sources require distributed compute. Common examples are Hadoop (HDFS, Hive, HBase), Spark, NoSQL, and Kafka.

PDI provides specialized steps and adapters for these platforms. This lets you transform data where it lives.

Start here: [Big Data](https://academy.pentaho.com/pentaho-data-integration/data-integration/data-sources/big-data).
{% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FhjNCONeNdBBjeFUWjITh%2Fimage.png?alt=media&#x26;token=e27af85f-dd97-4507-b3a9-dbcfe46f657c" alt=""><figcaption><p>Types of Big Data</p></figcaption></figure>
{% endtab %}

{% tab title="Jupyter Notebook" %}
{% hint style="info" %}

#### Jupyter Notebook

Jupyter is a web-based notebook for code, visuals, and narrative text. It works well for exploratory analysis and prototyping.

In a PDI workflow, notebooks often handle advanced analysis. PDI handles production orchestration and scheduled pipelines.

Start here: [Jupyter Notebook](https://academy.pentaho.com/pentaho-data-integration/data-integration/data-sources/jupyter-notebook).
{% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2F8VP33zd3JmcID4WYN31n%2Fimage.png?alt=media&#x26;token=1a54c01a-cbb1-4961-a666-4e611193738f" alt=""><figcaption><p>Jupyter Notebook</p></figcaption></figure>
{% endtab %}
{% endtabs %}
