Glossaries, Terms & Tags
In the beginning ..
The Fundamental Challenge
Business Terms: The Foundation of Data Understanding
Imagine walking into a massive library where every book is written in a different language, with no translations, no common indexing system, and where the same business concept - "customer," "revenue," or "conversion rate" - has 50 different names and definitions depending on who wrote about it. Marketing calls it "lead conversion," Sales tracks "deal closure rate," and Finance measures "revenue recognition efficiency," but they're all referring to similar business processes. That's essentially what modern enterprise data environments look like without a standardized business glossary.
Pentaho Data Catalog (PDC) is designed to solve the enterprise data discovery and governance challenge, but without business glossaries that define terms like "Annual Recurring Revenue," "Customer Lifetime Value," or "Net Promoter Score," it's like having a powerful search engine that can only search for cryptic technical codes (like "CUST_LTV_CALC_FLD" or "REV_REC_AMT") rather than the business meanings that stakeholders actually understand and use in their daily decision-making.
Business terms serve as the critical bridge between technical data assets and business value, enabling everyone from C-suite executives to business analysts to speak the same data language when discussing key performance indicators, business rules, and strategic metrics.
Here's the problem:

How a Glossary and Business Terms are the solution:
Translation Layer
PDC's powerful search and discovery features are only useful if users can find what they're looking for. Glossaries provide the business vocabulary that makes technical assets discoverable by non-technical users.
Without Glossary:
User searches for "client type" → No results
User searches for "customer classification" → No results
User gives up, emails IT, waits 3 days
IT explains it's called CUST_TYPE_CD
User forgets by next month, cycle repeats
With Glossary:
All variations point to single business term: "Customer Type"
Definition: "Classification of customer as Individual (I) or Store (S)"
User finds it immediately, understands it, uses it correctly
The Problem:
Consider the term "Status" in AdventureWorks:
Order Status (1=Processing, 2=Shipped, 3=Delivered)
Employee Status (A=Active, T=Terminated, L=Leave)
Payment Status (P=Pending, C=Completed, F=Failed)
Inventory Status (1=Available, 2=Reserved, 3=Backordered)
Without context, "Status = 1" is meaningless.
How a Glossary provides context:
PDC can now:
Show users what values mean, not just what they are
Enable accurate filtering (find all "Shipped" orders, not "Status=5")
Prevent misinterpretation in reports
Apply appropriate data quality rules per status type
Without Glossary:
Which of these 47 fields containing "revenue," "sales," or "amount" are affected?
What reports will break?
Which dashboards need updating?
Who needs to be notified?
With Glossary:
The glossary term "Total Revenue" shows:
Calculation: SubTotal + Tax + Freight
Source fields: Sales.SalesOrderHeader.SubTotal, TaxAmt, Freight
Used in: 23 reports, 8 dashboards, 3 ML models
Stakeholders: Finance, Sales, Operations teams
Downstream impacts: Commission calculations, quarterly forecasts
PDC uses glossary relationships to:
Track data lineage from source to consumption
Perform impact analysis before changes
Identify all stakeholders affected by modifications
Ensure changes don't break critical processes
The Challenge:
GDPR auditor asks: "Show me all personal data you collect about EU citizens and prove it's protected."
Without Glossary:
Manual database scan (500+ tables)
Guessing which fields contain PII
No documentation of protection measures
Weeks of preparation for audit
High risk of missing critical data
With Compliance Glossary:
Results instantly show:
All 47 PII fields
Protection status (encrypted, masked, etc.)
Retention periods
Legal basis for processing
Access controls in place
PDC becomes your compliance command center:
Automatically identifies sensitive data
Enforces governance policies
Generates audit reports
Tracks consent and retention
Proves regulatory compliance
Glossary enables Data Qualiity Rules:
PDC uses glossary definitions to:
Automatically generate data quality rules
Identify invalid values
Flag inconsistencies
Track quality metrics over time
Prevent bad data from entering systems
Tags
When building a business glossary in your data catalog, one of the most common questions is: "Should my tags match my business terms?"
The short answer is no—and understanding why will save you from a maintenance nightmare.
Think of tags and business terms like two different tools in your toolkit. Business terms are your formal, governed vocabulary with official definitions. They require approval, have clear ownership, and represent the "official" way your organization talks about data. They're the dictionary.
Tags, on the other hand, are lightweight labels that help people find things quickly. They're informal, flexible, and can be created on the fly. Think of them as sticky notes rather than encyclopedia entries.
The tags provide quick discovery, while the business term provides authoritative meaning.
Good Pattern:
Assign Table Tag:
Contains_Personal_Data (inheritance)
Assign Business Term:
"Personal Identifier" (formal definition)
Assign Tags:
PII, GDPR_Personal_Data, Sensitive (quick filters)
A better pattern uses business terms for formal definitions and tags for practical classification. For your GDPR compliance glossary, this might look like assigning the business term "Personal Data - Contact Information" (with a formal definition linking to GDPR Article 4) while using tags like PII, GDPR_Personal_Data, and contact_info.
The business term tells users what the data officially means. The tags help them find it through multiple paths—by sensitivity level, regulation, or functional category.
Tagging Strategy
The tagging strategy creates a flexible classification system that works alongside your formal business glossary. While your glossary terms define what something is ("Social Security Number"), tags describe what it does, requires, or relates to (PII, encryption-required, reg-hipaa, risk-critical).
The strategy organizes tags into controlled categories—regulatory, security, lifecycle, functional, and technical—each serving a specific governance purpose. This multi-dimensional approach allows one piece of data to be discovered and managed through multiple lenses simultaneously.
Regulatory Tags
Compliance Team
Yes
No
Sensitivity Tags
Security Team
Yes
No
Domain Tags
Data Stewards
Yes
No
Functional Tags
Data Owners
No
Yes (curated)
Technical Tags
Data Engineers
No
Yes
Discovery Tags
Data Catalog Admin
No
Yes
Tagging Strategy
Example - SSN
Consider how "SSN" gets enriched with tags: It inherits PII, personal-data, and compliance from its parent category, then receives specific tags like sensitive-PII, government-ID, reg-hipaa, risk-critical, encryption-required, and retention-indefinite. Each tag triggers specific actions: encryption-required tells your database team to enforce encryption; dlp-monitor activates data loss prevention scanning; mfa-required enforces access controls; and retention-indefinite prevents automated deletion. When an auditor asks "show me all data requiring HIPAA compliance," they filter by reg-hipaa and immediately see SSN, Date of Birth, and related audit trail terms—even though these live in different glossary categories.
Why this matters
This approach solves the "one term, multiple perspectives" problem. Your CFO cares that SSN is domain-finance for tax reporting. Your security team cares it's risk-critical and encryption-required. Your compliance team cares it's reg-gdpr and reg-hipaa. Your data engineers care it's tech-tokenization-recommended and dlp-monitor.
Without tags, you'd need to duplicate the SSN term across multiple glossaries or categories. With tags, one term serves all stakeholders, and each can filter, search, and automate based on their specific needs—making your data catalog truly operational, not just documentary.
x
x
Domain
AW Compliance & Data Governance
Strategic
compliance
data-governance
enterprise-wide
regulatory
privacy
security
adventureworks
data-stewardship
Category
Audit & Monitoring
Tactical / Functional
audit
monitoring
traceability
logging
compliance-tracking
security-monitoring
data-lineage
Term
Access Logs
Operational
access-log
security-monitoring
user-activity
authentication
authorization
SIEM-integration
Term
Audit Trail
Operational
audit-trail compliance-log
immutable forensic-evidence
regulatory-requirement
retention-7-years
Term
Change Log
Operational
change-log
version-control
data-modifications
accountability
compliance-evidence
SOX-relevant
Term
Data Lineage
Operational
data-lineage
data-flow
impact-analysis
transformation-tracking
source-to-target
GDPR-Article-30
Category
Data Classification
Tactical / Functional
data-classification
sensitivity-levels
access-control
information-security
data-handling
Term
Confidential Data
Operational
confidential
restricted-access
encryption-required
need-to-know
high-risk
breach-notification
Term
Internal Data
Operational
internal-use-only
employee-access
standard-encryption
no-external-sharing
medium-risk
Term
Public Data
Operational
public
unrestricted
no-encryption-required external-sharing-allowed
low-risk
Term
Restricted Data
Operational
restricted
highly-sensitive executive-approval-required strict-encryption
audit-trail-required
critical-risk
Category
PII Data
Tactical / Functional
PII
personal-data
privacy
data-subject-rights
sensitive-data
GDPR-relevant
protected-information
Term
Email address
Operational
PII contact-information
identifier
GDPR-Article-4
maskable
encrypted-at-rest
retention-policy
Category
Regulatory Compliance
Tactical / Functional
regulatory
compliance-frameworks
legal-requirements
industry-standards
mandatory-compliance
audit-requirements
Term
GDPR
Operational
GDPR
EU-regulation
data-protection
privacy-law
right-to-erasure
data-portability
consent-required
Term
SOX
Operational
SOX
financial-compliance
US-regulation
internal-controls
financial-reporting
audit-required
Term
PCI-DSS
Operational
PCI-DSS
payment-security
cardholder-data
industry-standard
encryption-required
network-security
Term
HIPAA
Operational
HIPAA
healthcare
PHI
US-regulation
medical-privacy
security-rule
privacy-rule
Last updated
Was this helpful?
