# Metadata Store

{% hint style="info" %}
**Metadata Store in Data Catalogs: A Foundation for Effective Data Management**

A **metadata store** within a data catalog serves as a critical repository for storing metadata from various sources, ensuring the preservation of their relationships, historical context, and vital details. Acting as the backbone of a data catalog's architecture, this specialized database is instrumental in streamlining the organization and handling of metadata. This, in turn, significantly aids in the efficient discovery, understanding, and governance of data.

The primary role of the metadata store is to facilitate easy access to the correct data assets for users by encompassing **technical**, **process**, and **business metadata**. This includes comprehensive information on the data's structure, its lineage, utilization patterns, and the business context it operates within. By making data assets readily identifiable, assessable, and actionable, the metadata store plays a pivotal role in bolstering data-driven decision-making processes.
{% endhint %}

{% tabs %}
{% tab title="Metadata Store" %}
{% hint style="info" %}
The diagram depicts a simplified representation of the key data model of the PDC Metadata Store (MDS). To understand the relationships between the 'Collections' , it's illustrated in a relational database format.&#x20;

The model illustrates the relationships between data sources and entities where the customer data assets and discovered metadata properties are persisted as rich JSON graphs.&#x20;

At the core of the model is the `mds.entities` collection.
{% endhint %}

<figure><img src="/files/RLDGk95E7jJXIkGYJhhN" alt=""><figcaption><p>Data Model of Pentaho Data Catalog - Metadata Store (MDS)</p></figcaption></figure>

#### Connect to Metadata Store

1. Run NoSQLBooster.

```bash
cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*
```

{% hint style="info" %}
So what happens when you connect/ingest/profile a datasource?
{% endhint %}

{% tabs %}
{% tab title="Connect" %}
{% hint style="info" %}
When you create a data source connection, for example, to the Synthea dataset on PostgreSQL, the details are persisted in the **resourceconnections** collection located in the **fe** database.
{% endhint %}

<figure><img src="/files/FUeH9TretHXj4zeL7qlR" alt=""><figcaption><p>resourceconnections</p></figcaption></figure>

{% hint style="info" %}
If not not familiar with MQL, the easiest way to query the MongoDB database is to use SQL..!

For example, to display the connection name in a dashboard:

**resourceName** where **configMethod = 'credentials'**&#x20;

The following table shows how SQL components map to MongoDB document components.
{% endhint %}

| SQL          | MongoDB    |
| ------------ | ---------- |
| database     | database   |
| table        | collection |
| column       | field      |
| row (record) | document   |

1. Run NoSQLBooster.

```bash
cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*
```

2. In the Query panel, enter the following SQL statement.

```bash
//Single quotes are for strings 'string'
mb.runSQLQuery(`
       SELECT resourceName FROM resourceconnections 
       WHERE configMethod = 'credentials'     
`).sort({_id:-1})
  .limit(100)
```

<figure><img src="/files/UHZPqxbZvq5Acq7HC68n" alt=""><figcaption><p>SQL Query</p></figcaption></figure>

As expected the result is: postgresql:synthea as there's currently only 1 connection defined.
{% endtab %}

{% tab title="Ingest" %}
{% hint style="info" %}
When you ingest the database schema, the connection details are persisted in the **datasources** collection located in the **mds** database.
{% endhint %}

<figure><img src="/files/R24Qp213j1Yo2gyy0ns7" alt=""><figcaption><p>mds.datasources</p></figcaption></figure>

{% hint style="info" %}
The database schema metadata details are persisted in the **entities** collection (synthea - 179 documents).

In the example below, the **zip** column properties, from the **providers** 'table'.
{% endhint %}

<figure><img src="/files/4RmK3fxHq7lPxh1CMo0Y" alt=""><figcaption><p>mds.entities</p></figcaption></figure>

{% hint style="info" %}
Taking a look at the MDS diagram above we can see that the recursive **mds.entities.type** refers to:

SCHEMA - synthea

&#x20;  TABLE - 'aggregated' documents

&#x20;     RESOURCE

&#x20;        COLUMN -&#x20;

This is useful for querying the dataset. The mds.entites collection consists of documents (records) that are referenced by knowing the 'type:'.
{% endhint %}

1. Edit the query to retrieve the fields on the 'patients' table (you could have ingested more than 1 schema).

```bash
db.entities.find({type:'TABLE',_id:'1/business_apps_db/synthea/patients'})
   .projection({})
   .sort({_id:-1})
   .limit(100)
```

{% hint style="success" %}
TIP: To find the TABLE \_id, under Actions select the option: Copy path.
{% endhint %}

<figure><img src="/files/WGdpTQJVPdrgZRIXWkul" alt=""><figcaption><p>Copy Path</p></figcaption></figure>

```bash
mb.runSQLQuery(`
       SELECT * FROM entities 
       WHERE type = 'TABLE' AND name = 'patients'       
`).sort({_id:-1})
  .limit(100)
```

<figure><img src="/files/Te2BcY2M71vyd8hnzijb" alt=""><figcaption><p>SQL Query - type: 'TABLE' name:'patients'</p></figcaption></figure>

***

{% hint style="info" %}
Notice the out-of-the-box predefined collections for:

* datapatterns
* dictionaries
* dictionaryTerms
  {% endhint %}
  {% endtab %}

{% tab title="Profile" %}
{% hint style="info" %}
The data profiling process generates statistical and intermediate data that is required by other data analytic processes. The intermediate data is consumed by downstream processes such as data flow and foreign key detection.

Referencing the MDS model, the resulting **mds.data\_profile**&#x20;
{% endhint %}

The intermediate data generated for each column of data includes:

<table><thead><tr><th width="212">Profile Action</th><th>Description</th></tr></thead><tbody><tr><td>Roaring Bitset</td><td>A bitmap of the hash values for all entries in the column.</td></tr><tr><td>HyperLogLog (HLL)</td><td>Provides an estimate of the cardinality of the data, with a roughly ~2% margin of error.</td></tr><tr><td>Data Pattern Analysis</td><td>Performs a rudimentary data pattern analysis using dimensional reduction, tracking the most frequently occurring patterns.</td></tr><tr><td>Data Quality Pre-Analysis</td><td>Using the Data Pattern Analysis results, Data Catalog performs a statistical estimation of the data quality is performed. This is summarized as an overall percentage as well as a heat map for each data pattern. Additionally, Data Catalog makes RegEx recommendations for the most probable matches.</td></tr><tr><td>Statistics</td><td><p>Data Catalog gathers the following statistics when examining all the data:</p><p>• Minimum and Maximum values (for numeric columns)</p><p>• Widest and Narrowest (non-null) string widths</p><p>• Null count</p><p>• Total row count</p></td></tr><tr><td>Data Sampling</td><td>Data Catalog takes a controlled sampling of the data so that the samples are consistently chosen across different columns.</td></tr></tbody></table>

{% hint style="info" %}
Taking a look at the MDS diagram above we can see that the recursive **mds.entities.type** refers to:

SCHEMA - synthea

&#x20;  TABLE&#x20;

&#x20;     RESOURCE

&#x20;        COLUMN

This is useful for querying the dataset. The mds.entites collection consists of documents (records) that are referenced by knowing the 'type:'.
{% endhint %}

1. Edit the query to retrieve the fields for the '**patients.passport**' column.

```bash
db.entities.find({type:'COLUMN',_id:'1/business_apps_db/synthea/patients/passport'})
   .projection({})
   .sort({_id:-1})
   .limit(100)
```

{% hint style="success" %}
TIP: To find the COLUMN \_id, under Actions select the option: Copy path.
{% endhint %}

```bash
mb.runSQLQuery(`
   SELECT * FROM entities
   WHERE type = 'COLUMN' AND name = 'passport'
`).sort({_id:-1})
.limit(100)
```

<figure><img src="/files/b5vdyE1SOCWRFR2SGUIB" alt=""><figcaption><p>Data Profiling fields for patients.passport column.</p></figcaption></figure>

{% hint style="info" %}
The Statistics, Data Patterns and Properties are displayed in the Data Canvas.
{% endhint %}

<figure><img src="/files/pOixscNNfAusP1NvwlZj" alt=""><figcaption><p>Statistics - patients.passport</p></figcaption></figure>

{% hint style="info" %}
Obviously .. a lot more data is captured, which is stored in **mds.data\_profile** collection.
{% endhint %}

2. Edit the query to retrieve the fields for the '**mds.data\_profile.patients.passport**' column.

```bash
db.data_profile.find({_id:'1/business_apps_db/synthea/patients/passport'})
   .projection({})
   .sort({_id:-1})
   .limit(100)
```

```bash
mb.runSQLQuery(`
       SELECT * FROM data_profile 
       WHERE _id = '1/business_apps_db/synthea/patients/passport'     
`).sort({_id:-1})
  .limit(100)
```

<figure><img src="/files/cEgqgG49oMWsh0aYeZ9P" alt=""><figcaption><p>Data Profile - data_profile.passports</p></figcaption></figure>

x

x
{% endtab %}

{% tab title="Reporting" %}
{% hint style="warning" %}
Ensure you have installed & configured [**DBVisualizer**](#dbvisualizer-and-presto) before commencing the following steps.
{% endhint %}

x

x

x

x

x
{% endtab %}
{% endtabs %}
{% endtab %}

{% tab title="NoSQLBooster" %}
{% hint style="info" %}
NoSQLBooster is a cross-platform GUI tool for MongoDB Server 3.6-7.0, which provides a build-in MongoDB script debugger, comprehensive server monitoring tools, chaining fluent query, SQL query, query code generator, task scheduling, ES2020 support, and advanced IntelliSense experience.
{% endhint %}

x

{% hint style="warning" %}
To watch the videos please copy and paste the website URL into your **host** Chrome browser.
{% endhint %}

x

{% tabs %}
{% tab title="Enable MongoDB Ports" %}
{% hint style="warning" %}
Be aware this exposes MongoDB port ..!
{% endhint %}

1. Navigate to the PDC depolyment folder.

```bash
cd
cd /opt/pentaho/pdc-docker-deployment/vendor/
```

2. Edit docker-compose.yml file.

```bash
nano docker-compose.yml
```

3. Uncomment ports under mongodb in docker-compose.yml

```
  mongodb:
    image: ${PDC_IMAGE_PREFIX}pdc-mongodb-ee:${PDC_MONGODB_TAG:-master}
    ports:
      - "27017:27017"
```

4. Save changes.

```
CTRL + O
Enter
CTRL + X
```

5. Restart PDC.

```bash
cd
cd /opt/pentaho/pdc-docker-deployment
sudo ./pdc.sh up
```

{% hint style="warning" %}
Ensure all the containers successfully restart ..
{% endhint %}
{% endtab %}

{% tab title="Installation" %}
{% hint style="info" %}
x
{% endhint %}

<figure><img src="/files/Y8IGg1xtkC0AP4fhNfy9" alt=""><figcaption><p>NoSQLBooster</p></figcaption></figure>

1. Download NoSQLBooster.

```bash
cd
cd ~/Downloads
wget https://s3.nosqlbooster.com/download/releasesv8/nosqlbooster4mongo-8.1.7.tar.gz
```

2. Untar the package.

```bash
cd
cd ~/Downloads
tar -xvf nosqlbooster4mongo-8.1.7.tar.gz
```

3. Move the NoSQLBooster folder to home.

```bash
cd
cd ~/Downloads
mv nosqlbooster4mongo* /home/pdc
```

4. Run NoSQLBooster.

```bash
cd
cd ~/nosqlbooster4mongo*
ls
./nosqlbooster4mongo*
```

<figure><img src="/files/mg6LiZLVkBYuUHm35vYy" alt=""><figcaption><p>NoSQLBooster</p></figcaption></figure>
{% endtab %}

{% tab title="Connect" %}
{% hint style="info" %}

{% endhint %}

1. Run NoSQLBooster.

```bash
cd
cd ~/nosqlbooster4mongo*
./nosqlbooster4mongo*
```

2. From the Connect option -> Connect From Connection String URI

![](/files/W67BJ0fgDEuEASk10RSF)

3. Enter the following details.

<table data-header-hidden><thead><tr><th width="196"></th><th></th></tr></thead><tbody><tr><td>Connection string</td><td>mongodb://root:broot@localhost:27017</td></tr></tbody></table>

<figure><img src="/files/Bvd9LL2WjWj8yFvw8JH4" alt=""><figcaption><p>Connection URI</p></figcaption></figure>

4. Finally ..  click 'Connect'

<figure><img src="/files/iQL6gS9RNggoqrvQQjKH" alt=""><figcaption><p>Connect to MongoDB</p></figcaption></figure>

***

#### **Troubleshooting** <a href="#troubleshooting" id="troubleshooting"></a>

When a VM is started up, sometimes not all Docker PDC docker containers start up and as a result some functionality may not work such as:

* Login page not showing login panel
* Register Page not creating root user
* Adding a data source and Test Connection button doesn't work

In this case, navigate to the PDC docker deployment location (`/opt/pentaho/pdc-docker-deployment`) and run `./pdc.sh up` to ensure all PDC containers are running.
{% endtab %}
{% endtabs %}
{% endtab %}

{% tab title="Compass" %}
{% hint style="info" %}
MongoDB Compass is a powerful GUI tool that simplifies the processes of querying, aggregating, and analyzing MongoDB data through a user-friendly visual interface.
{% endhint %}

{% embed url="<https://www.mongodb.com/products/tools/compass>" %}

{% hint style="warning" %}
Ensure you have enabled the MongoDB ports: [**Enable MongoDB ports**](#enable-mongodb-ports)
{% endhint %}

<figure><img src="/files/lGcyrSChLuKNUGNvyrW3" alt=""><figcaption><p>Compass</p></figcaption></figure>

{% tabs %}
{% tab title="Installation" %}
{% hint style="info" %}
Check for the latest package:

[**https://www.mongodb.com/try/download/compass**](https://www.mongodb.com/try/download/compass)
{% endhint %}

1. Update the Ubuntu system’s repositories.

```bash
sudo apt update
```

2. Download MongoDB compass.

{% hint style="warning" %}
You may need to edit the version in the request below.
{% endhint %}

```bash
cd
cd ~/Downloads
wget https://downloads.mongodb.com/compass/mongodb-compass_1.43.4_amd64.deb
```

3. Install MongoDB Compass.

```bash
cd
cd ~/Downloads
sudo apt install ./mongodb-compass_1.43.4_amd64.deb
```

{% endtab %}

{% tab title="Connect" %}
{% hint style="info" %}

{% endhint %}

1. Click on New Connection.

<figure><img src="/files/BUY55fzqgN5wIQFIT7G3" alt=""><figcaption><p>Compass - URI</p></figcaption></figure>

{% hint style="warning" %}
Ensure you check the option: Direct Connection
{% endhint %}

2. Click on the Authentication option.

<figure><img src="/files/OSwpBUdWhWOszmfYDWvy" alt=""><figcaption><p>Compass - Authentication</p></figcaption></figure>

3. Enter the following details:

<table data-header-hidden><thead><tr><th width="231"></th><th></th></tr></thead><tbody><tr><td>Username</td><td>root</td></tr><tr><td>Password</td><td>broot</td></tr></tbody></table>

4. Click Connect.

<figure><img src="/files/NwA4WEdqsVcuaie0RAQi" alt=""><figcaption><p>Compass - mds.entities</p></figcaption></figure>
{% endtab %}

{% tab title="Reporting" %}
{% hint style="info" %}
Through the MongoDB Atlas SQL Interface, Connectors, and Drivers, you can leverage your existing SQL knowledge to query and analyze live application data directly from your preferred SQL-based tool.
{% endhint %}

{% embed url="<https://www.mongodb.com/try/download/jdbc-driver>" %}
Link to connectors & drivers for reporting
{% endembed %}

x

x
{% endtab %}
{% endtabs %}
{% endtab %}

{% tab title="Altair" %}
{% hint style="info" %}
Altair is a feature-rich GraphQL client that enables developers to debug and test their GraphQL queries. It provides an intuitive interface for crafting queries, setting variables, and viewing responses.&#x20;

Altair supports various features like query autocompletion, response formatting, and even documentation exploration, making it an invaluable tool for working with GraphQL APIs.&#x20;
{% endhint %}

<figure><img src="/files/8iFlF84AtL2gzZjFzBg0" alt=""><figcaption><p>Altair</p></figcaption></figure>

1. To access Altair: [**http://localhost/graghql2**](http://localhost/graghql2)
2. Enter your query:

<table><thead><tr><th width="280">Query</th><th>Description</th></tr></thead><tbody><tr><td>query {userCount(filter:{})}</td><td>Number of Users</td></tr><tr><td><p>query {userMany(limit:100){ email </p><p>date </p><p>firstName </p><p>lastName </p><p>lastLogin </p><p>modified </p><p>modifiedBy </p><p>status </p><p>_id }}</p></td><td>Returns User information - Limit 100</td></tr><tr><td></td><td></td></tr><tr><td></td><td></td></tr></tbody></table>
{% endtab %}

{% tab title="DBVisualizer & Presto" %}
{% hint style="info" %}
The bidb database is structured to allow a JDBC MySQL connection. Useful for creating audit reports.&#x20;
{% endhint %}

<figure><img src="/files/UNBUbqDWsoex9W7XlWu5" alt=""><figcaption><p>bidb database</p></figcaption></figure>

{% tabs %}
{% tab title="Install DbVisualizer & Presto" %}
{% hint style="info" %}
DbVisualizer lets you visualize, manage and edit data with the power of SQL and simplicity of a spreadsheet.
{% endhint %}

#### Install Java

1. Ensure the `apt` libraries are updated.

```bash
sudo apt update && sudo apt upgrade -y
```

2. Check if Java has been installed.

```bash
java -version
```

3. To see all available editions of OpenJDK.

```bash
apt-cache search openjdk
```

4. Install OpenJDK 21 JDK.

```bash
sudo apt install openjdk-21-jdk
```

5. Install OpenJDK 21 JRE.

```bash
sudo apt install openjdk-21-jre-headless
```

5. Check installation.

```bash
java -version
```

```
pdc@pdc:~$ java -version
openjdk version "21.0.3" 2024-04-16
OpenJDK Runtime Environment (build 21.0.3+9-Ubuntu-1ubuntu1)
OpenJDK 64-Bit Server VM (build 21.0.3+9-Ubuntu-1ubuntu1, mixed mode, sharing)
```

***

#### Set the Java environment variables

{% hint style="info" %}
Although Java is already installed, further configuration is required. Setting a few environment variables makes OpenJDK easier to use and allows other applications to find it.&#x20;
{% endhint %}

1. Edit the `.bashrc` file in your home directory.

```bash
cd
sudo nano ~/.bashrc
```

2. Add the following lines to the bottom of the file.

```
export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
export PATH=$PATH:$JAVA_HOME/bin
```

3. Save.

```
Ctrl + o
Enter
Ctrl + x
```

3. Source to apply changes.

```bash
source ~/.bashrc
```

4. Echo `JAVA_HOME` and `PATH` to verify they are set correctly.

```bash
echo $JAVA_HOME
echo $PATH
```

```
/usr/lib/jvm/java-21-openjdk-amd64
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/usr/lib/jvm/java-21-openjdk-amd64/bin
```

{% hint style="info" %}
To set these values for all system users, add the following changes to `/etc/environment` instead.
{% endhint %}

***

#### Install DbVisualizer

1. Download the Linux - without Java version.

{% embed url="<https://www.dbvis.com/download/>" %}
Download DBVisualizer
{% endembed %}

<figure><img src="/files/LUanzUamate1mmdZ4rHW" alt=""><figcaption><p>DBVisualizer - download</p></figcaption></figure>

2. Execute the script located in the Downloads directory.

{% hint style="warning" %}
Ensure the file has execute permission.

You will need to edit the version before executing the shell script.
{% endhint %}

```bash
cd
cd ~/Downloads
./dbvis_linux_24_1_3.sh
```

3. Accept the default path / settings.

<figure><img src="/files/bdWZiFJEfNszXjsKClV0" alt=""><figcaption><p>Keep default path</p></figcaption></figure>

4. Once 'Finished'. Accept Free license forever.

<figure><img src="/files/DWtWNRUNxdK9SDilHOFJ" alt=""><figcaption><p>DBVisualizer UI</p></figcaption></figure>

5. To start DbVisualizer.

```bash
cd
cd ~/DbVisualizer
./dbvis
```

{% endtab %}

{% tab title="Connect to bidb database" %}
{% hint style="info" %}
As we know MongoDB stores the data as a 'Collection' of 'Documents' in a flat hierarchical structure.&#x20;

The BIDB database contains aggregated documents, useful for reporting.
{% endhint %}

1. Start DBVisualizer.

```bash
cd
cd ~/DbVisualizer
./dbvis
```

2. Click on the blue cross to create a new connection.

<figure><img src="/files/8UZNi1ODUDGttAYIcI9Y" alt=""><figcaption><p>Create a new database connection</p></figcaption></figure>

3. From the drop-down list, select MySQL 5.

<figure><img src="/files/hC6vz0I6DIehF6Y8QTR5" alt=""><figcaption><p>MySQL 5</p></figcaption></figure>

4. Enter the following details:

<table><thead><tr><th width="267"></th><th></th></tr></thead><tbody><tr><td>Connection Name</td><td>BIDB</td></tr><tr><td>Database Server</td><td>localhost</td></tr><tr><td>Database Port</td><td>3307</td></tr><tr><td>Use SSL</td><td>Automatic</td></tr><tr><td>Database Userid</td><td>root</td></tr><tr><td>Database Password</td><td>groot</td></tr></tbody></table>

<figure><img src="/files/EcznkkGGE41oKyub6DJC" alt=""><figcaption><p>BIDB Connection</p></figcaption></figure>

5. Ping the server & Connect.
6. using SQL Commander, Query the database.

<figure><img src="/files/Mjl7wCd2MpaZ6744La9y" alt=""><figcaption><p>SQL Query</p></figcaption></figure>
{% endtab %}
{% endtabs %}
{% endtab %}
{% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.pentaho.com/pentaho-data-catalog-en/setup/components/metadata-store.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
