# Glossaries, Terms & Tags

{% hint style="info" %}

### The Fundamental Challenge

**Business Terms: The Foundation of Data Understanding**

Imagine walking into a massive library where every book is written in a different language, with no translations, no common indexing system, and where the same business concept - "customer," "revenue," or "conversion rate" - has 50 different names and definitions depending on who wrote about it. Marketing calls it "lead conversion," Sales tracks "deal closure rate," and Finance measures "revenue recognition efficiency," but they're all referring to similar business processes. That's essentially what modern enterprise data environments look like without a standardized business glossary.

**Pentaho Data Catalog (PDC)** is designed to solve the enterprise data discovery and governance challenge, but without business glossaries that define terms like "Annual Recurring Revenue," "Customer Lifetime Value," or "Net Promoter Score," it's like having a powerful search engine that can only search for cryptic technical codes (like "CUST\_LTV\_CALC\_FLD" or "REV\_REC\_AMT") rather than the business meanings that stakeholders actually understand and use in their daily decision-making.

Business terms serve as the critical bridge between technical data assets and business value, enabling everyone from C-suite executives to business analysts to speak the same data language when discussing key performance indicators, business rules, and strategic metrics.
{% endhint %}

Here's the problem:

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FX19JbkBgm18SAtxdvpIe%2Fimage.png?alt=media&#x26;token=90638468-2f7b-422b-acdc-e2b78f5246bc" alt=""><figcaption></figcaption></figure>

{% tabs %}
{% tab title="Glossary & Business Terms" %}
{% hint style="info" %}

#### Glossary & Business Terms

{% endhint %}

How a Glossary and Business Terms are the solution:

{% tabs %}
{% tab title="Translation Layer" %}
{% hint style="info" %}

#### Translation Layer

PDC's powerful search and discovery features are only useful if users can find what they're looking for. Glossaries provide the business vocabulary that makes technical assets discoverable by non-technical users.
{% endhint %}

**Without Glossary:**

* User searches for "client type" → No results
* User searches for "customer classification" → No results
* User gives up, emails IT, waits 3 days
* IT explains it's called CUST\_TYPE\_CD
* User forgets by next month, cycle repeats

**With Glossary:**

* All variations point to single business term: "Customer Type"
* Definition: "Classification of customer as Individual (I) or Store (S)"
* User finds it immediately, understands it, uses it correctly
  {% endtab %}

{% tab title="Semantic Understanding" %}
{% hint style="info" %}

#### The Problem:

Consider the term "Status" in AdventureWorks:

* Order Status (1=Processing, 2=Shipped, 3=Delivered)
* Employee Status (A=Active, T=Terminated, L=Leave)
* Payment Status (P=Pending, C=Completed, F=Failed)
* Inventory Status (1=Available, 2=Reserved, 3=Backordered)

Without context, "Status = 1" is meaningless.
{% endhint %}

**How a Glossary provides context:**

```json
{
  "name": "Order Status",
  "definition": "Current state of customer order in fulfillment process",
  "valid_values": {
    "1": "In Process - Order received, not yet shipped",
    "2": "Approved - Payment verified, ready to ship",
    "3": "Backordered - Awaiting inventory",
    "4": "Rejected - Payment failed or cancelled",
    "5": "Shipped - Left warehouse",
    "6": "Cancelled - Customer cancelled"
  },
  "business_rules": "Orders move from 1→2→5 normally. Status 3 triggers reorder."
}
```

{% hint style="info" %}
PDC can now:

* Show users what values mean, not just what they are
* Enable accurate filtering (find all "Shipped" orders, not "Status=5")
* Prevent misinterpretation in reports
* Apply appropriate data quality rules per status type
  {% endhint %}
  {% endtab %}

{% tab title="Lineage & Impact Analysis" %}
{% hint style="info" %}

#### The Scenario:

The finance team wants to change how "Revenue" is calculated.
{% endhint %}

**Without Glossary:**

* Which of these 47 fields containing "revenue," "sales," or "amount" are affected?
* What reports will break?
* Which dashboards need updating?
* Who needs to be notified?

**With Glossary:**

The glossary term "Total Revenue" shows:

* **Calculation:** SubTotal + Tax + Freight
* **Source fields:** Sales.SalesOrderHeader.SubTotal, TaxAmt, Freight
* **Used in:** 23 reports, 8 dashboards, 3 ML models
* **Stakeholders:** Finance, Sales, Operations teams
* **Downstream impacts:** Commission calculations, quarterly forecasts

{% hint style="info" %}
PDC uses glossary relationships to:

* Track data lineage from source to consumption
* Perform impact analysis before changes
* Identify all stakeholders affected by modifications
* Ensure changes don't break critical processes
  {% endhint %}
  {% endtab %}

{% tab title="Compliance & Governance" %}
{% hint style="info" %}

#### The Challenge:

GDPR auditor asks: "Show me all personal data you collect about EU citizens and prove it's protected."
{% endhint %}

**Without Glossary:**

* Manual database scan (500+ tables)
* Guessing which fields contain PII
* No documentation of protection measures
* Weeks of preparation for audit
* High risk of missing critical data

**With Compliance Glossary:**

```sql
SELECT * FROM glossary_terms 
WHERE tags LIKE '%pii%' 
AND tags LIKE '%gdpr%'
```

Results instantly show:

* All 47 PII fields
* Protection status (encrypted, masked, etc.)
* Retention periods
* Legal basis for processing
* Access controls in place

{% hint style="info" %}
PDC becomes your compliance command center:

* Automatically identifies sensitive data
* Enforces governance policies
* Generates audit reports
* Tracks consent and retention
* Proves regulatory compliance
  {% endhint %}
  {% endtab %}

{% tab title="Data Quality" %}
{% hint style="info" %}

#### The Data Quality Challenge:

How do you know if "Customer Type = X" is an error?
{% endhint %}

**Glossary enables Data Qualiity Rules:**

```json
{
  "name": "Customer Type",
  "valid_values": ["I", "S"],
  "quality_rules": {
    "completeness": "NOT NULL",
    "validity": "IN ('I','S')",
    "consistency": "If Store_Name EXISTS, then Customer_Type = 'S'"
  }
}
```

{% hint style="info" %}
PDC uses glossary definitions to:

* Automatically generate data quality rules
* Identify invalid values
* Flag inconsistencies
* Track quality metrics over time
* Prevent bad data from entering systems
  {% endhint %}
  {% endtab %}
  {% endtabs %}
  {% endtab %}

{% tab title="Tags" %}
{% hint style="info" %}

#### Tags

When building a business glossary in your data catalog, one of the most common questions is: "Should my tags match my business terms?"&#x20;

The short answer is no—and understanding why will save you from a maintenance nightmare.

Think of tags and business terms like two different tools in your toolkit. Business terms are your formal, governed vocabulary with official definitions. They require approval, have clear ownership, and represent the "official" way your organization talks about data. They're the dictionary.

Tags, on the other hand, are lightweight labels that help people find things quickly. They're informal, flexible, and can be created on the fly. Think of them as sticky notes rather than encyclopedia entries.

The tags provide **quick discovery**, while the business term provides **authoritative meaning**.
{% endhint %}

**Good Pattern:**

<table><thead><tr><th width="217">Dictionary</th><th>Action</th></tr></thead><tbody><tr><td>Assign Table Tag: </td><td>Contains_Personal_Data (inheritance)</td></tr><tr><td>Assign Business Term:</td><td>"Personal Identifier" (formal definition)</td></tr><tr><td>Assign Tags:</td><td>PII, GDPR_Personal_Data, Sensitive (quick filters)</td></tr></tbody></table>

{% hint style="info" %}
A better pattern uses business terms for formal definitions and tags for practical classification. For your GDPR compliance glossary, this might look like assigning the business term "Personal Data - Contact Information" (with a formal definition linking to GDPR Article 4) while using tags like `PII`, `GDPR_Personal_Data`, and `contact_info`.

The business term tells users what the data officially means. The tags help them find it through multiple paths—by sensitivity level, regulation, or functional category.
{% endhint %}

***

{% hint style="info" %}

#### Tagging Strategy

The tagging strategy creates a **flexible classification system** that works alongside your formal business glossary. While your glossary terms define *what something is* ("Social Security Number"), tags describe *what it does, requires, or relates to* (`PII`, `encryption-required`, `reg-hipaa`, `risk-critical`).&#x20;

The strategy organizes tags into controlled categories—regulatory, security, lifecycle, functional, and technical—each serving a specific governance purpose. This multi-dimensional approach allows one piece of data to be discovered and managed through multiple lenses simultaneously.
{% endhint %}

| Tag Type         | Owner              | Approval Required | Can Users Add? |
| ---------------- | ------------------ | ----------------- | -------------- |
| Regulatory Tags  | Compliance Team    | Yes               | No             |
| Sensitivity Tags | Security Team      | Yes               | No             |
| Domain Tags      | Data Stewards      | Yes               | No             |
| Functional Tags  | Data Owners        | No                | Yes (curated)  |
| Technical Tags   | Data Engineers     | No                | Yes            |
| Discovery Tags   | Data Catalog Admin | No                | Yes            |

{% hint style="info" %}

## Tagging Strategy&#x20;

**Example - SSN**

Consider how "SSN" gets enriched with tags: It inherits `PII`, `personal-data`, and `compliance` from its parent category, then receives specific tags like `sensitive-PII`, `government-ID`, `reg-hipaa`, `risk-critical`, `encryption-required`, and `retention-indefinite`. Each tag triggers specific actions: `encryption-required` tells your database team to enforce encryption; `dlp-monitor` activates data loss prevention scanning; `mfa-required` enforces access controls; and `retention-indefinite` prevents automated deletion. When an auditor asks "show me all data requiring HIPAA compliance," they filter by `reg-hipaa` and immediately see SSN, Date of Birth, and related audit trail terms—even though these live in different glossary categories.

**Why this matters**

This approach solves the **"one term, multiple perspectives"** problem. Your CFO cares that SSN is `domain-finance` for tax reporting. Your security team cares it's `risk-critical` and `encryption-required`. Your compliance team cares it's `reg-gdpr` and `reg-hipaa`. Your data engineers care it's `tech-tokenization-recommended` and `dlp-monitor`.&#x20;

Without tags, you'd need to duplicate the SSN term across multiple glossaries or categories. With tags, one term serves all stakeholders, and each can filter, search, and automate based on their specific needs—making your data catalog truly operational, not just documentary.
{% endhint %}

x

{% tabs %}
{% tab title="First Tab" %}
x
{% endtab %}

{% tab title="Tag Matrix" %}
{% hint style="info" %}

#### Tag Matrix

{% endhint %}

<table data-full-width="true"><thead><tr><th width="151" valign="middle">Hierachy Level</th><th width="164">Name</th><th width="123">Purpose</th><th width="238" valign="middle">AW Tags</th></tr></thead><tbody><tr><td valign="middle"><mark style="color:blue;">Domain</mark></td><td><mark style="color:blue;">AW Compliance &#x26; Data Governance</mark></td><td><mark style="color:blue;">Strategic</mark></td><td valign="middle"><p><mark style="color:blue;"><code>compliance</code></mark></p><p><mark style="color:blue;"><code>data-governance</code></mark></p><p><mark style="color:blue;"><code>enterprise-wide</code></mark></p><p><mark style="color:blue;"><code>regulatory</code></mark></p><p><mark style="color:blue;"><code>privacy</code></mark></p><p><mark style="color:blue;"><code>security</code></mark></p><p><mark style="color:blue;"><code>adventureworks</code></mark></p><p><mark style="color:blue;"><code>data-stewardship</code></mark></p></td></tr><tr><td valign="middle"><mark style="color:orange;">Category</mark></td><td><mark style="color:orange;">Audit &#x26; Monitoring</mark></td><td><mark style="color:orange;">Tactical / Functional</mark></td><td valign="middle"><p><mark style="color:orange;"><code>audit</code></mark></p><p><mark style="color:orange;"><code>monitoring</code></mark></p><p><mark style="color:orange;"><code>traceability</code></mark></p><p><mark style="color:orange;"><code>logging</code></mark></p><p><mark style="color:orange;"><code>compliance-tracking</code></mark></p><p><mark style="color:orange;"><code>security-monitoring</code></mark></p><p><mark style="color:orange;"><code>data-lineage</code></mark></p></td></tr><tr><td valign="middle">Term</td><td>Access Logs</td><td>Operational</td><td valign="middle"><p><code>access-log</code> </p><p><code>security-monitoring</code></p><p><code>user-activity</code> </p><p><code>authentication</code> </p><p><code>authorization</code> </p><p><code>SIEM-integration</code></p></td></tr><tr><td valign="middle">Term</td><td>Audit Trail</td><td>Operational</td><td valign="middle"><p><code>audit-trail</code> <code>compliance-log</code> </p><p><code>immutable</code> <code>forensic-evidence</code></p><p><code>regulatory-requirement</code></p><p><code>retention-7-years</code></p></td></tr><tr><td valign="middle">Term</td><td>Change Log</td><td>Operational</td><td valign="middle"><p><code>change-log</code> </p><p><code>version-control</code> </p><p><code>data-modifications</code> </p><p><code>accountability</code> </p><p><code>compliance-evidence</code> </p><p><code>SOX-relevant</code></p></td></tr><tr><td valign="middle">Term</td><td>Data Lineage</td><td>Operational</td><td valign="middle"><p><code>data-lineage</code></p><p><code>data-flow</code></p><p><code>impact-analysis</code> </p><p><code>transformation-tracking</code> </p><p><code>source-to-target</code> </p><p><code>GDPR-Article-30</code></p></td></tr><tr><td valign="middle"><mark style="color:orange;">Category</mark></td><td><mark style="color:orange;">Data Classification</mark></td><td><mark style="color:orange;">Tactical / Functional</mark></td><td valign="middle"><p><mark style="color:orange;"><code>data-classification</code></mark></p><p><mark style="color:orange;"><code>sensitivity-levels</code></mark></p><p><mark style="color:orange;"><code>access-control</code></mark></p><p><mark style="color:orange;"><code>information-security</code></mark></p><p><mark style="color:orange;"><code>data-handling</code></mark></p></td></tr><tr><td valign="middle">Term</td><td>Confidential Data</td><td>Operational</td><td valign="middle"><p><code>confidential</code> </p><p><code>restricted-access</code></p><p><code>encryption-required</code> </p><p><code>need-to-know</code> </p><p><code>high-risk</code> </p><p><code>breach-notification</code></p></td></tr><tr><td valign="middle">Term</td><td>Internal Data</td><td>Operational</td><td valign="middle"><p><code>internal-use-only</code> </p><p><code>employee-access</code> </p><p><code>standard-encryption</code></p><p><code>no-external-sharing</code> </p><p><code>medium-risk</code></p></td></tr><tr><td valign="middle">Term</td><td>Public Data</td><td>Operational</td><td valign="middle"><p><code>public</code> </p><p><code>unrestricted</code> </p><p><code>no-encryption-required</code> <code>external-sharing-allowed</code></p><p><code>low-risk</code></p></td></tr><tr><td valign="middle">Term</td><td>Restricted Data</td><td>Operational</td><td valign="middle"><p><code>restricted</code> </p><p><code>highly-sensitive</code> <code>executive-approval-required</code> <code>strict-encryption</code> </p><p><code>audit-trail-required</code> </p><p><code>critical-risk</code></p></td></tr><tr><td valign="middle"><mark style="color:orange;">Category</mark></td><td><mark style="color:orange;">PII Data</mark></td><td><mark style="color:orange;">Tactical / Functional</mark></td><td valign="middle"><p><mark style="color:orange;"><code>PII</code></mark></p><p><mark style="color:orange;"><code>personal-data</code></mark></p><p><mark style="color:orange;"><code>privacy</code></mark></p><p><mark style="color:orange;"><code>data-subject-rights</code></mark></p><p><mark style="color:orange;"><code>sensitive-data</code></mark></p><p><mark style="color:orange;"><code>GDPR-relevant</code></mark></p><p><mark style="color:orange;"><code>protected-information</code></mark></p></td></tr><tr><td valign="middle">Term</td><td>Email address</td><td>Operational</td><td valign="middle"><p><code>PII</code> <code>contact-information</code> </p><p><code>identifier</code> </p><p><code>GDPR-Article-4</code> </p><p><code>maskable</code> </p><p><code>encrypted-at-rest</code> </p><p><code>retention-policy</code></p></td></tr><tr><td valign="middle"><mark style="color:orange;">Category</mark></td><td><mark style="color:orange;">Regulatory Compliance</mark></td><td><mark style="color:orange;">Tactical / Functional</mark></td><td valign="middle"><p><mark style="color:orange;"><code>regulatory</code></mark></p><p><mark style="color:orange;"><code>compliance-frameworks</code></mark></p><p><mark style="color:orange;"><code>legal-requirements</code></mark></p><p><mark style="color:orange;"><code>industry-standards</code></mark></p><p><mark style="color:orange;"><code>mandatory-compliance</code></mark></p><p><mark style="color:orange;"><code>audit-requirements</code></mark></p></td></tr><tr><td valign="middle">Term</td><td>GDPR</td><td>Operational</td><td valign="middle"><p><code>GDPR</code> </p><p><code>EU-regulation</code> </p><p><code>data-protection</code> </p><p><code>privacy-law</code> </p><p><code>right-to-erasure</code> </p><p><code>data-portability</code> </p><p><code>consent-required</code></p></td></tr><tr><td valign="middle">Term</td><td>SOX</td><td>Operational</td><td valign="middle"><p><code>SOX</code> </p><p><code>financial-compliance</code></p><p><code>US-regulation</code></p><p><code>internal-controls</code></p><p><code>financial-reporting</code></p><p><code>audit-required</code></p></td></tr><tr><td valign="middle">Term</td><td>PCI-DSS</td><td>Operational</td><td valign="middle"><p><code>PCI-DSS</code> </p><p><code>payment-security</code> </p><p><code>cardholder-data</code> </p><p><code>industry-standard</code> </p><p><code>encryption-required</code> </p><p><code>network-security</code></p></td></tr><tr><td valign="middle">Term</td><td>HIPAA</td><td>Operational</td><td valign="middle"><p><code>HIPAA</code> </p><p><code>healthcare</code> </p><p><code>PHI</code> </p><p><code>US-regulation</code> </p><p><code>medical-privacy</code> </p><p><code>security-rule</code> </p><p><code>privacy-rule</code></p></td></tr></tbody></table>
{% endtab %}
{% endtabs %}
{% endtab %}
{% endtabs %}
