Flat files, databases, storage, big data, and notebooks.
Use this page to orient yourself. Then jump into the specific connector docs:
In practice, a data source is either:
A file format you parse (CSV, Excel, JSON, XML).
A service you connect to (a DB, object store, cluster, or API).
Use flat files when your data arrives as CSV, TXT, fixed-width, JSON, or XML.
Start here: Flat Files.
Structured data uses a predefined model. It is easy to validate and query.
Think tables, rows, and columns. Examples include SQL databases and well-formed CSV files.
Unstructured data has no consistent schema. Examples include PDFs, images, video, and free-form text.
You typically need parsing, extraction, or ML to use it.
Semi-structured data has a loose schema. It uses tags or keys to describe fields and hierarchy.
Common formats are JSON and XML.
Metadata is “data about data”. Examples include headers, schemas, and data dictionaries.
Pentaho connects to databases primarily through JDBC drivers. These drivers are the main interface for database communication.
Start here: Databases.
Storage sources are cloud or network repositories. Examples include Amazon S3, Azure Blob Storage, and Google Cloud Storage.
In PDI, you typically connect through VFS. You can read and write across hybrid environments.
Start here: Storage.
Big data sources require distributed compute. Common examples are Hadoop (HDFS, Hive, HBase), Spark, NoSQL, and Kafka.
PDI provides specialized steps and adapters for these platforms. This lets you transform data where it lives.
Start here: Big Data.
Jupyter is a web-based notebook for code, visuals, and narrative text. It works well for exploratory analysis and prototyping.
In a PDI workflow, notebooks often handle advanced analysis. PDI handles production orchestration and scheduled pipelines.
Start here: Jupyter Notebook.
Last updated 1 month ago
Was this helpful?