# Metadata Store {% hint style="info" %} **Metadata Store in Data Catalogs: A Foundation for Effective Data Management** A **metadata store** within a data catalog serves as a critical repository for storing metadata from various sources, ensuring the preservation of their relationships, historical context, and vital details. Acting as the backbone of a data catalog's architecture, this specialized database is instrumental in streamlining the organization and handling of metadata. This, in turn, significantly aids in the efficient discovery, understanding, and governance of data. The primary role of the metadata store is to facilitate easy access to the correct data assets for users by encompassing **technical**, **process**, and **business metadata**. This includes comprehensive information on the data's structure, its lineage, utilization patterns, and the business context it operates within. By making data assets readily identifiable, assessable, and actionable, the metadata store plays a pivotal role in bolstering data-driven decision-making processes. {% endhint %} {% tabs %} {% tab title="Metadata Store" %} {% hint style="info" %} The diagram depicts a simplified representation of the key data model of the PDC Metadata Store (MDS). To understand the relationships between the 'Collections' , it's illustrated in a relational database format. The model illustrates the relationships between data sources and entities where the customer data assets and discovered metadata properties are persisted as rich JSON graphs. At the core of the model is the `mds.entities` collection. {% endhint %}

Data Model of Pentaho Data Catalog - Metadata Store (MDS)

#### Connect to Metadata Store 1. Run NoSQLBooster. ```bash cd cd ~/nosqlbooster4mongo* ./nosqlbooster4mongo* ``` {% hint style="info" %} So what happens when you connect/ingest/profile a datasource? {% endhint %} {% tabs %} {% tab title="Connect" %} {% hint style="info" %} When you create a data source connection, for example, to the Synthea dataset on PostgreSQL, the details are persisted in the **resourceconnections** collection located in the **fe** database. {% endhint %}

{% hint style="info" %} If not not familiar with MQL, the easiest way to query the MongoDB database is to use SQL..! For example, to display the connection name in a dashboard: **resourceName** where **configMethod = 'credentials'** The following table shows how SQL components map to MongoDB document components. {% endhint %} | SQL | MongoDB | | ------------ | ---------- | | database | database | | table | collection | | column | field | | row (record) | document | 1. Run NoSQLBooster. ```bash cd cd ~/nosqlbooster4mongo* ./nosqlbooster4mongo* ``` 2. In the Query panel, enter the following SQL statement. ```bash //Single quotes are for strings 'string' mb.runSQLQuery(` SELECT resourceName FROM resourceconnections WHERE configMethod = 'credentials' `).sort({_id:-1}) .limit(100) ```

As expected the result is: postgresql:synthea as there's currently only 1 connection defined. {% endtab %} {% tab title="Ingest" %} {% hint style="info" %} When you ingest the database schema, the connection details are persisted in the **datasources** collection located in the **mds** database. {% endhint %}

{% hint style="info" %} The database schema metadata details are persisted in the **entities** collection (synthea - 179 documents). In the example below, the **zip** column properties, from the **providers** 'table'. {% endhint %}

{% hint style="info" %} Taking a look at the MDS diagram above we can see that the recursive **mds.entities.type** refers to: SCHEMA - synthea TABLE - 'aggregated' documents RESOURCE COLUMN - This is useful for querying the dataset. The mds.entites collection consists of documents (records) that are referenced by knowing the 'type:'. {% endhint %} 1. Edit the query to retrieve the fields on the 'patients' table (you could have ingested more than 1 schema). ```bash db.entities.find({type:'TABLE',_id:'1/business_apps_db/synthea/patients'}) .projection({}) .sort({_id:-1}) .limit(100) ``` {% hint style="success" %} TIP: To find the TABLE \_id, under Actions select the option: Copy path. {% endhint %}

```bash mb.runSQLQuery(` SELECT * FROM entities WHERE type = 'TABLE' AND name = 'patients' `).sort({_id:-1}) .limit(100) ```

SQL Query - type: 'TABLE' name:'patients'

*** {% hint style="info" %} Notice the out-of-the-box predefined collections for: * datapatterns * dictionaries * dictionaryTerms {% endhint %} {% endtab %} {% tab title="Profile" %} {% hint style="info" %} The data profiling process generates statistical and intermediate data that is required by other data analytic processes. The intermediate data is consumed by downstream processes such as data flow and foreign key detection. Referencing the MDS model, the resulting **mds.data\_profile** {% endhint %} The intermediate data generated for each column of data includes:

Profile Action	Description
Roaring Bitset	A bitmap of the hash values for all entries in the column.
HyperLogLog (HLL)	Provides an estimate of the cardinality of the data, with a roughly ~2% margin of error.
Data Pattern Analysis	Performs a rudimentary data pattern analysis using dimensional reduction, tracking the most frequently occurring patterns.
Data Quality Pre-Analysis	Using the Data Pattern Analysis results, Data Catalog performs a statistical estimation of the data quality is performed. This is summarized as an overall percentage as well as a heat map for each data pattern. Additionally, Data Catalog makes RegEx recommendations for the most probable matches.
Statistics	Data Catalog gathers the following statistics when examining all the data: • Minimum and Maximum values (for numeric columns) • Widest and Narrowest (non-null) string widths • Null count • Total row count
Data Sampling	Data Catalog takes a controlled sampling of the data so that the samples are consistently chosen across different columns.

{% hint style="info" %} Taking a look at the MDS diagram above we can see that the recursive **mds.entities.type** refers to: SCHEMA - synthea TABLE RESOURCE COLUMN This is useful for querying the dataset. The mds.entites collection consists of documents (records) that are referenced by knowing the 'type:'. {% endhint %} 1. Edit the query to retrieve the fields for the '**patients.passport**' column. ```bash db.entities.find({type:'COLUMN',_id:'1/business_apps_db/synthea/patients/passport'}) .projection({}) .sort({_id:-1}) .limit(100) ``` {% hint style="success" %} TIP: To find the COLUMN \_id, under Actions select the option: Copy path. {% endhint %} ```bash mb.runSQLQuery(` SELECT * FROM entities WHERE type = 'COLUMN' AND name = 'passport' `).sort({_id:-1}) .limit(100) ```

Data Profiling fields for patients.passport column.

{% hint style="info" %} The Statistics, Data Patterns and Properties are displayed in the Data Canvas. {% endhint %}

{% hint style="info" %} Obviously .. a lot more data is captured, which is stored in **mds.data\_profile** collection. {% endhint %} 2. Edit the query to retrieve the fields for the '**mds.data\_profile.patients.passport**' column. ```bash db.data_profile.find({_id:'1/business_apps_db/synthea/patients/passport'}) .projection({}) .sort({_id:-1}) .limit(100) ``` ```bash mb.runSQLQuery(` SELECT * FROM data_profile WHERE _id = '1/business_apps_db/synthea/patients/passport' `).sort({_id:-1}) .limit(100) ```

x x {% endtab %} {% tab title="Reporting" %} {% hint style="warning" %} Ensure you have installed & configured [**DBVisualizer**](#dbvisualizer-and-presto) before commencing the following steps. {% endhint %} x x x x x {% endtab %} {% endtabs %} {% endtab %} {% tab title="NoSQLBooster" %} {% hint style="info" %} NoSQLBooster is a cross-platform GUI tool for MongoDB Server 3.6-7.0, which provides a build-in MongoDB script debugger, comprehensive server monitoring tools, chaining fluent query, SQL query, query code generator, task scheduling, ES2020 support, and advanced IntelliSense experience. {% endhint %} x {% hint style="warning" %} To watch the videos please copy and paste the website URL into your **host** Chrome browser. {% endhint %} x {% tabs %} {% tab title="Enable MongoDB Ports" %} {% hint style="warning" %} Be aware this exposes MongoDB port ..! {% endhint %} 1. Navigate to the PDC depolyment folder. ```bash cd cd /opt/pentaho/pdc-docker-deployment/vendor/ ``` 2. Edit docker-compose.yml file. ```bash nano docker-compose.yml ``` 3. Uncomment ports under mongodb in docker-compose.yml ``` mongodb: image: ${PDC_IMAGE_PREFIX}pdc-mongodb-ee:${PDC_MONGODB_TAG:-master} ports: - "27017:27017" ``` 4. Save changes. ``` CTRL + O Enter CTRL + X ``` 5. Restart PDC. ```bash cd cd /opt/pentaho/pdc-docker-deployment sudo ./pdc.sh up ``` {% hint style="warning" %} Ensure all the containers successfully restart .. {% endhint %} {% endtab %} {% tab title="Installation" %} {% hint style="info" %} x {% endhint %}

1. Download NoSQLBooster. ```bash cd cd ~/Downloads wget https://s3.nosqlbooster.com/download/releasesv8/nosqlbooster4mongo-8.1.7.tar.gz ``` 2. Untar the package. ```bash cd cd ~/Downloads tar -xvf nosqlbooster4mongo-8.1.7.tar.gz ``` 3. Move the NoSQLBooster folder to home. ```bash cd cd ~/Downloads mv nosqlbooster4mongo* /home/pdc ``` 4. Run NoSQLBooster. ```bash cd cd ~/nosqlbooster4mongo* ls ./nosqlbooster4mongo* ```

{% endtab %} {% tab title="Connect" %} {% hint style="info" %} {% endhint %} 1. Run NoSQLBooster. ```bash cd cd ~/nosqlbooster4mongo* ./nosqlbooster4mongo* ``` 2. From the Connect option -> Connect From Connection String URI ![](/files/W67BJ0fgDEuEASk10RSF) 3. Enter the following details.


Connection string	mongodb://root:broot@localhost:27017

4. Finally .. click 'Connect'

*** #### **Troubleshooting** When a VM is started up, sometimes not all Docker PDC docker containers start up and as a result some functionality may not work such as: * Login page not showing login panel * Register Page not creating root user * Adding a data source and Test Connection button doesn't work In this case, navigate to the PDC docker deployment location (`/opt/pentaho/pdc-docker-deployment`) and run `./pdc.sh up` to ensure all PDC containers are running. {% endtab %} {% endtabs %} {% endtab %} {% tab title="Compass" %} {% hint style="info" %} MongoDB Compass is a powerful GUI tool that simplifies the processes of querying, aggregating, and analyzing MongoDB data through a user-friendly visual interface. {% endhint %} {% embed url="" %} {% hint style="warning" %} Ensure you have enabled the MongoDB ports: [**Enable MongoDB ports**](#enable-mongodb-ports) {% endhint %}

{% tabs %} {% tab title="Installation" %} {% hint style="info" %} Check for the latest package: [**https://www.mongodb.com/try/download/compass**](https://www.mongodb.com/try/download/compass) {% endhint %} 1. Update the Ubuntu system’s repositories. ```bash sudo apt update ``` 2. Download MongoDB compass. {% hint style="warning" %} You may need to edit the version in the request below. {% endhint %} ```bash cd cd ~/Downloads wget https://downloads.mongodb.com/compass/mongodb-compass_1.43.4_amd64.deb ``` 3. Install MongoDB Compass. ```bash cd cd ~/Downloads sudo apt install ./mongodb-compass_1.43.4_amd64.deb ``` {% endtab %} {% tab title="Connect" %} {% hint style="info" %} {% endhint %} 1. Click on New Connection.

{% hint style="warning" %} Ensure you check the option: Direct Connection {% endhint %} 2. Click on the Authentication option.

3. Enter the following details:


Username	root
Password	broot

4. Click Connect.

{% endtab %} {% tab title="Reporting" %} {% hint style="info" %} Through the MongoDB Atlas SQL Interface, Connectors, and Drivers, you can leverage your existing SQL knowledge to query and analyze live application data directly from your preferred SQL-based tool. {% endhint %} {% embed url="" %} Link to connectors & drivers for reporting {% endembed %} x x {% endtab %} {% endtabs %} {% endtab %} {% tab title="Altair" %} {% hint style="info" %} Altair is a feature-rich GraphQL client that enables developers to debug and test their GraphQL queries. It provides an intuitive interface for crafting queries, setting variables, and viewing responses. Altair supports various features like query autocompletion, response formatting, and even documentation exploration, making it an invaluable tool for working with GraphQL APIs. {% endhint %}

1. To access Altair: [**http://localhost/graghql2**](http://localhost/graghql2) 2. Enter your query:

Query	Description
query {userCount(filter:{})}	Number of Users
query {userMany(limit:100){ email date firstName lastName lastLogin modified modifiedBy status _id }}	Returns User information - Limit 100

{% endtab %} {% tab title="DBVisualizer & Presto" %} {% hint style="info" %} The bidb database is structured to allow a JDBC MySQL connection. Useful for creating audit reports. {% endhint %}

{% tabs %} {% tab title="Install DbVisualizer & Presto" %} {% hint style="info" %} DbVisualizer lets you visualize, manage and edit data with the power of SQL and simplicity of a spreadsheet. {% endhint %} #### Install Java 1. Ensure the `apt` libraries are updated. ```bash sudo apt update && sudo apt upgrade -y ``` 2. Check if Java has been installed. ```bash java -version ``` 3. To see all available editions of OpenJDK. ```bash apt-cache search openjdk ``` 4. Install OpenJDK 21 JDK. ```bash sudo apt install openjdk-21-jdk ``` 5. Install OpenJDK 21 JRE. ```bash sudo apt install openjdk-21-jre-headless ``` 5. Check installation. ```bash java -version ``` ``` pdc@pdc:~$ java -version openjdk version "21.0.3" 2024-04-16 OpenJDK Runtime Environment (build 21.0.3+9-Ubuntu-1ubuntu1) OpenJDK 64-Bit Server VM (build 21.0.3+9-Ubuntu-1ubuntu1, mixed mode, sharing) ``` *** #### Set the Java environment variables {% hint style="info" %} Although Java is already installed, further configuration is required. Setting a few environment variables makes OpenJDK easier to use and allows other applications to find it. {% endhint %} 1. Edit the `.bashrc` file in your home directory. ```bash cd sudo nano ~/.bashrc ``` 2. Add the following lines to the bottom of the file. ``` export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java)))) export PATH=$PATH:$JAVA_HOME/bin ``` 3. Save. ``` Ctrl + o Enter Ctrl + x ``` 3. Source to apply changes. ```bash source ~/.bashrc ``` 4. Echo `JAVA_HOME` and `PATH` to verify they are set correctly. ```bash echo $JAVA_HOME echo $PATH ``` ``` /usr/lib/jvm/java-21-openjdk-amd64 /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/usr/lib/jvm/java-21-openjdk-amd64/bin ``` {% hint style="info" %} To set these values for all system users, add the following changes to `/etc/environment` instead. {% endhint %} *** #### Install DbVisualizer 1. Download the Linux - without Java version. {% embed url="" %} Download DBVisualizer {% endembed %}

2. Execute the script located in the Downloads directory. {% hint style="warning" %} Ensure the file has execute permission. You will need to edit the version before executing the shell script. {% endhint %} ```bash cd cd ~/Downloads ./dbvis_linux_24_1_3.sh ``` 3. Accept the default path / settings.

4. Once 'Finished'. Accept Free license forever.

5. To start DbVisualizer. ```bash cd cd ~/DbVisualizer ./dbvis ``` {% endtab %} {% tab title="Connect to bidb database" %} {% hint style="info" %} As we know MongoDB stores the data as a 'Collection' of 'Documents' in a flat hierarchical structure. The BIDB database contains aggregated documents, useful for reporting. {% endhint %} 1. Start DBVisualizer. ```bash cd cd ~/DbVisualizer ./dbvis ``` 2. Click on the blue cross to create a new connection.

3. From the drop-down list, select MySQL 5.

4. Enter the following details:


Connection Name	BIDB
Database Server	localhost
Database Port	3307
Use SSL	Automatic
Database Userid	root
Database Password	groot

5. Ping the server & Connect. 6. using SQL Commander, Query the database.

{% endtab %} {% endtabs %} {% endtab %} {% endtabs %} --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://academy.pentaho.com/pentaho-data-catalog-en/setup/components/metadata-store.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.