Glossaries, Terms & Tags
In the beginning ..
The Fundamental Challenge
Business Terms: The Foundation of Data Understanding
Imagine walking into a massive library where every book is written in a different language, with no translations, no common indexing system, and where the same business concept - "customer," "revenue," or "conversion rate" - has 50 different names and definitions depending on who wrote about it. Marketing calls it "lead conversion," Sales tracks "deal closure rate," and Finance measures "revenue recognition efficiency," but they're all referring to similar business processes. That's essentially what modern enterprise data environments look like without a standardized business glossary.
Pentaho Data Catalog (PDC) is designed to solve the enterprise data discovery and governance challenge, but without business glossaries that define terms like "Annual Recurring Revenue," "Customer Lifetime Value," or "Net Promoter Score," it's like having a powerful search engine that can only search for cryptic technical codes (like "CUST_LTV_CALC_FLD" or "REV_REC_AMT") rather than the business meanings that stakeholders actually understand and use in their daily decision-making.
Business terms serve as the critical bridge between technical data assets and business value, enabling everyone from C-suite executives to business analysts to speak the same data language when discussing key performance indicators, business rules, and strategic metrics.
Here's the problem:

How a Glossary and Business Terms are the solution:
Translation Layer
PDC's powerful search and discovery features are only useful if users can find what they're looking for. Glossaries provide the business vocabulary that makes technical assets discoverable by non-technical users.
Without Glossary:
User searches for "client type" → No results
User searches for "customer classification" → No results
User gives up, emails IT, waits 3 days
IT explains it's called CUST_TYPE_CD
User forgets by next month, cycle repeats
With Glossary:
All variations point to single business term: "Customer Type"
Definition: "Classification of customer as Individual (I) or Store (S)"
User finds it immediately, understands it, uses it correctly
The Problem:
Consider the term "Status" in AdventureWorks:
Order Status (1=Processing, 2=Shipped, 3=Delivered)
Employee Status (A=Active, T=Terminated, L=Leave)
Payment Status (P=Pending, C=Completed, F=Failed)
Inventory Status (1=Available, 2=Reserved, 3=Backordered)
Without context, "Status = 1" is meaningless.
How a Glossary provides context:
{
"name": "Order Status",
"definition": "Current state of customer order in fulfillment process",
"valid_values": {
"1": "In Process - Order received, not yet shipped",
"2": "Approved - Payment verified, ready to ship",
"3": "Backordered - Awaiting inventory",
"4": "Rejected - Payment failed or cancelled",
"5": "Shipped - Left warehouse",
"6": "Cancelled - Customer cancelled"
},
"business_rules": "Orders move from 1→2→5 normally. Status 3 triggers reorder."
}Without Glossary:
Which of these 47 fields containing "revenue," "sales," or "amount" are affected?
What reports will break?
Which dashboards need updating?
Who needs to be notified?
With Glossary:
The glossary term "Total Revenue" shows:
Calculation: SubTotal + Tax + Freight
Source fields: Sales.SalesOrderHeader.SubTotal, TaxAmt, Freight
Used in: 23 reports, 8 dashboards, 3 ML models
Stakeholders: Finance, Sales, Operations teams
Downstream impacts: Commission calculations, quarterly forecasts
The Challenge:
GDPR auditor asks: "Show me all personal data you collect about EU citizens and prove it's protected."
Without Glossary:
Manual database scan (500+ tables)
Guessing which fields contain PII
No documentation of protection measures
Weeks of preparation for audit
High risk of missing critical data
With Compliance Glossary:
SELECT * FROM glossary_terms
WHERE tags LIKE '%pii%'
AND tags LIKE '%gdpr%'Results instantly show:
All 47 PII fields
Protection status (encrypted, masked, etc.)
Retention periods
Legal basis for processing
Access controls in place
Glossary enables Data Qualiity Rules:
{
"name": "Customer Type",
"valid_values": ["I", "S"],
"quality_rules": {
"completeness": "NOT NULL",
"validity": "IN ('I','S')",
"consistency": "If Store_Name EXISTS, then Customer_Type = 'S'"
}
}Last updated
Was this helpful?
