Hierarchical Data Type

handling Hierarchical Data Types - JSON & YAML ..

A hierarchical data type is a data type that represents a hierarchical structure of data, where each data element has a parent-child relationship with other data elements. A hierarchical data type can be used to store and query data that is organized in a tree-like fashion, such as organizational charts, file systems, or taxonomies.

A hierarchical data type has some advantages, such as compactness, depth-first ordering, and support for arbitrary insertions and deletions. However, it also has some limitations, such as the need for application logic to maintain the tree structure, the difficulty of handling multiple parents or complex relationships, and the lack of standardization across different database systems.

A common example is employees and managers: employees and managers are both employees of a company. A manager can have employees they manage, and can also have a manager themselves.

Hierarchical Data Type (HDT) is a new datatype in PDI for handling structured/complex/nested datatype based on JSON / YAML (v10.1 release) format.

There are 7 new plugins/steps:

• Hierarchical JSON Input - is used to get data in HDT from file / previous steps and convert it into JSON formatted string.

• Hierarchical JSON Output -

• Hierarchical YAML Input - is used to get data in HDT from file / previous steps and convert it into ? formatted string.

• Hierarchical YAML Output -

• Extract to Rows -

• Modify values from a single row -

• Modify values from grouped rows -

As part of the Pentaho Data Integration & Analytics plugin release journey to decouple plugins from the core Pentaho Server, Pentaho EE 9.5 GA is releasing new plugins and enhancements to its existing plugin collection.

Log into the 'Pentaho Support Portal' and download the plugin.

Select the Pentaho version.

Download selected plugin(s).

Extract HDT plugin.

cd
cd ~/Downloads
unzip hierarchical-datatype-plugin-10.1.0.0-317-dist.zip .

Install HDT plugin.

cd
cd ~/Downloads/hierarchical-datatype-plugin-10.1.0.0-317-dist/hierarchical-datatype-plugin-10.1.0.0-317
./install.sh

Accept License Agreement -> Next

Browse to ../data-integration/plugins directory

Click 'Next' and accept overwrite warning.

Restart Pentaho Data Integration & check for Hierarchical folder.

The following Labs highlight some of the Use Cases

The Extract rows step is obvious .. Working in combination with the Hierarchical JSON Input step you are able to filter and extract specific row(s).

Open the following transformation:

~/Workshop--Pentaho-Data-Integration/Module 3

You can use the Hierarchical JSON input step to load JSON data into PDI from a file / previous step.

Filters to load only the desired data. The data can be split on a hierarchical data path using wildcards.

Source tab

Double-click on the Hierarchical JSON Input step to see how its configured.

Option/Field

Description

From file

Select to specify the file path and name of the JSON file you want to load into PDI.

File name

File path and name of the JSON file to load.

From field

Select to use an incoming field as the JSON file path.

Field with file name

The incoming field containing the JSON file path.

Output

The Split rows across path option is especially useful when loading JSON array objects within large JSON files.

When you use the Split rows across path field you must specify all filter paths rooted at the split path. If you do not use the Split rows across path field a normal HDT extraction path is used.

Click on the Output tab.

Field

Description

Output field

Specify the field name for output column.

Split rows across path

Specify the JSON path to be parsed.

In this example, suppose this JSON file contained other hierarchies based on business units, salary, managers, etc .. The split rows across path: $.employees[*] references all the employees fields, the syntax referencing the path to employees from the root.

Filters

Use the Path field (Optional) to specify the filters to apply while using the Split rows across path option to fetch the subset of a JSON file.