Business Rules
Data Quality Rules ..
Business Rules
Business rules serve as critical data quality controls within a data catalog by defining and enforcing standards for how data should be structured, validated, and maintained across an organization. These rules act as executable policies that automatically check data against predefined criteria—such as format requirements, value ranges, referential integrity, and business logic constraints—flagging or preventing quality issues before they propagate through downstream systems.
By embedding business rules directly into the data catalog, organizations create a centralized governance framework that ensures consistency, accuracy, and compliance while providing data stewards and users with clear visibility into quality thresholds and validation status. This proactive approach transforms the data catalog from a passive documentation tool into an active quality management system that continuously monitors and maintains the trustworthiness of enterprise data assets.

The ROI of Business Rules in PDC
1. Proactive Detection vs. Reactive Response
Without Rules: Problems discovered weeks/months later during reporting or by customers
With Rules: Issues detected within hours of data entry
Value: Reduced cost of fixing errors by 10-100x (earlier detection = cheaper fixes)
2. Automation of Manual Checks
Without Rules: Data analysts manually query databases weekly to check quality
With Rules: Automated daily/hourly checks with instant alerts
Value: Free up 20-40 hours per analyst per month for strategic work
3. Trust and Confidence in Data
Without Rules: Business users question every report
With Rules: Certified data with quality scores builds trust

Value: Faster decision-making, reduced meeting time debating data accuracy
4. Compliance and Auditability
Without Rules: Difficult to prove data governance practices
With Rules: Documented quality standards with execution history
Value: Pass audits, avoid regulatory fines, demonstrate due diligence
5. Cross-Team Alignment
Without Rules: Each team has different quality standards
With Rules: Centralized, agreed-upon quality definitions
Value: Reduced conflicts, consistent metrics across organization
The Cost of Poor Data Quality
Organizations typically experience:
Financial Impact: 15-25% of revenue lost due to poor data quality (Gartner)
Operational Inefficiency: Teams spend 30-40% of time fixing data issues
Missed Opportunities: Incorrect customer data leads to lost sales and poor customer experience
Compliance Risks: Regulatory violations due to inaccurate reporting
Decision-Making Errors: Bad data leads to wrong strategic decisions
Data Quality Dimensions
Data quality is multi-faceted. A single "is it good?" question isn't enough. Each dimension addresses a different question:

Anatomy of a Business Rule
Metadata (Who, What, Why)
Name: Identifies the rule
Description: Explains purpose
Dimension: Categorizes the quality aspect
Tags: Enables searching and grouping

Scope (Where to Check)
Target table(s) and column(s)
SQL query defining what data to examine
Filters to narrow focus (e.g., only active customers)

Logic (How to Measure)
SQL query returning three critical numbers:
total_count: All records in scopescopeCount: Records that meet quality standardsnonCompliant: Records that fail quality standards

Thresholds (When to Act)
Percentage or count that triggers PASS/WARNING/FAIL
Different actions for different severity levels
Actions (What Happens Next)
Set status for visibility
Apply tags for filtering
Trigger webhooks for integration (JIRA ticket, email notifications, etc)
Single Table Target Selection
Pentaho Data Catalog has a key architectural constraint you must understand before creating rules:
You can only select ONE table or column as the target resource for a business rule.
This means:
When configuring the rule scope, you select a single table from the catalog
Quality scores and tags are applied ONLY to that selected table
The rule result appears in the catalog only for that one table/column
Your SQL query has NO limitations:
You can JOIN as many tables as needed
You can reference any tables in your database
Full SQL capabilities are available
Complex multi-table logic is supported
Practical Impact
Scenario: You want to validate that SalesOrderHeader.SubTotal equals the sum of SalesOrderDetail.LineTotal
What happens:
Your SQL Query:
- References: SalesOrderHeader AND SalesOrderDetail (2 tables)
Your Target Resource Selection in UI:
- Can only select: SalesOrderHeader OR SalesOrderDetail (1 table)
- NOT both
Result Applied:
- Quality tags, status, and scores only show on the selected table
- The other table won't show this rule's results in the catalogWhy This Matters
Catalog Visibility: Users browsing the catalog only see quality info on the target table
Tag Application: Only the target table gets tagged with quality status
Design Decisions: You must choose which table is "primary" for each rule
Workaround Planning: Multi-table scenarios require strategic rule design
Decision Framework: Choosing the Target Table
When your rule involves multiple tables, which one should you select as the target?
Decision Criteria
Question 1: Which table do users primarily interact with?
If users typically query
SalesOrderHeaderand rarely touchSalesOrderDetailSelect:
SalesOrderHeader(the user-facing table)Why: Quality visibility where users need it most
Question 2: Where does the data quality problem originate?
If calculation errors come from wrong
SubTotalin header (not line items)Select:
SalesOrderHeader(the source of the problem)Why: Tags and status flag the problematic table
Question 3: Which table is the "parent" in the relationship?
In
SalesOrderHeader(1) →SalesOrderDetail(many) relationshipSelect:
SalesOrderHeader(the parent)Why: Logical hierarchy - order owns its details
Question 4: Which table failure has bigger business impact?
If wrong subtotal in
SalesOrderHeaderaffects invoicing (critical)But wrong line item in
SalesOrderDetailonly affects reporting (less critical)Select:
SalesOrderHeader(higher business impact)Why: Prioritize visibility for highest-risk data
Question 5: Which table is in the data lineage flow first?
If data flows: Source →
SalesOrderHeader→SalesOrderDetail→ WarehouseSelect:
SalesOrderHeader(earlier in flow)Why: Catch problems upstream before they propagate
Decision Matrix
This matrix shows real-world scenarios where your SQL query references multiple tables, but you can only select ONE as the target in Data Catalog.
Order subtotal = sum of line items
SalesOrderHeader
Parent table, user-facing, higher impact
Customer has valid address
Sales.Customer
Parent table, users browse customers not addresses
Product price consistency over time
Production.Product
Parent table, users search products not price history
Employee in valid department
HumanResources.Employee
Parent table, employee record is what users view
Inventory quantity valid
Production.ProductInventory
This is the table users query for stock levels
Business Problem
You need to validate that the SubTotal field in the order header actually equals the sum of all the line item prices for that order.
Tables Involved:
Sales.SalesOrderHeader(has the SubTotal field)Sales.SalesOrderDetail(has individual line item prices)
SQL Query:
SELECT COUNT(*) AS total_count, ...
FROM Sales.SalesOrderHeader h
LEFT JOIN (
SELECT SalesOrderID, SUM(LineTotal) AS Total
FROM Sales.SalesOrderDetail
GROUP BY SalesOrderID
) d ON h.SalesOrderID = d.SalesOrderIDThe Choice:
✅ Select: SalesOrderHeader as target
❌ Don't Select: SalesOrderDetail as target
Why SalesOrderHeader?
Parent table - It's the "one" in a one-to-many relationship (one order has many line items)
User-facing - When someone searches for "order 12345", they look at the header, not individual line items
Higher business impact - The SubTotal is what appears on the invoice sent to customers
Field location - The field you're validating (SubTotal) is IN the SalesOrderHeader table
What Happens in PDC:
When the rule runs and finds problems:
✅ Users browsing
SalesOrderHeadertable see: "⚠️ Warning: Calculation accuracy 95%"✅ Quality tags appear on SalesOrderHeader: "calculation_error", "needs_review"
❌ Users browsing
SalesOrderDetailtable see: Nothing (no quality indicator)
Practical Example:
Imagine a business analyst searching the data catalog:
Searches for "sales orders"
Finds
SalesOrderHeadertableSees quality badge: "Last validated: Today, Status: WARNING"
Reads: "10 orders have incorrect subtotals"
Now knows to investigate before using this data
If you had selected SalesOrderDetail instead:
That same analyst looking at orders wouldn't see any quality warning
They'd use bad data without knowing there's a problem
The warning would only show on the detail table (which they're not browsing)
Tables Involved:
Sales.Customer(the customer record)Person.BusinessEntityAddress(links customers to addresses)Person.Address(the actual address data)
SQL Query:
SELECT COUNT(*) AS total_count, ...
FROM Sales.Customer c
LEFT JOIN Person.BusinessEntityAddress ba ON c.PersonID = ba.BusinessEntityID
LEFT JOIN Person.Address a ON ba.AddressID = a.AddressID
WHERE ba.AddressTypeID = 3 -- Shipping addressChoice:
✅ Select: Sales.Customer as target
❌ Don't Select: Person.Address as target
Why Sales.Customer?
Parent table - Customer is the primary entity you're evaluating
Users browse customers - Marketing, sales, and customer service teams search for customers, not addresses
Business question - "Do our customers have valid data?" not "Are our addresses valid?"
Actionable insight - You need to know WHICH customers are missing addresses, not which addresses are missing customers
What Happens in PDC:
Scenario: Rule finds 50 customers without shipping addresses
If you selected Sales.Customer:
✅ Customer table shows: "Customer data quality: 95% complete"
✅ Marketing manager searches customers, sees quality issue
✅ Tag on Customer table: "incomplete_shipping_data"
✅ Manager knows to fix customer records before campaign
If you selected Person.Address instead:
❌ Address table shows quality status (but nobody browses address table directly)
❌ Marketing manager searches customers, sees NO warning
❌ Runs campaign, discovers 50 customers can't receive shipments
❌ Too late - campaign already sent
Real-World Impact:
A customer service rep is looking up customer "John Smith":
Opens customer record in the catalog
Sees quality indicator: "⚠️ No shipping address"
Contacts customer BEFORE processing order
Prevents delivery failure
Business Problem
You want to flag products where the price changed by more than 50% (possible data entry error).
Tables Involved:
Production.Product(current product info)Production.ProductListPriceHistory(historical prices)
SQL Query:
SELECT COUNT(*) AS total_count, ...
FROM Production.ProductListPriceHistory h
WHERE ABS(h.ListPrice - LAG(h.ListPrice) OVER (...)) / LAG(h.ListPrice) OVER (...) > 0.5Choice:
✅ Select: Production.Product as target
❌ Don't Select: Production.ProductListPriceHistory as target
Why Production.Product?
Parent table - Product is the entity being evaluated
Users search products - People look up "Product ABC", not "price history for product ABC"
Current relevance - Price issues affect the PRODUCT, not the historical record
Actionable - Product manager needs to know "this product has pricing issues", not "this historical record has issues"
What Happens in PDC:
Scenario: Product "Mountain Bike" price jumped from $500 to $5,000 (likely typo - should be $550)
If you selected Production.Product:
✅ Product manager searches "Mountain Bike"
✅ Sees: "⚠️ Recent price change flagged for review"
✅ Investigates, finds $5,000 is error, corrects to $550
✅ Fixes before any customer sees wrong price
If you selected Production.ProductListPriceHistory instead:
❌ Price history table shows quality issue (but who browses that?)
❌ Product manager looks at current product, sees no warning
❌ Wrong price goes live on website
❌ Customers see $5,000 price, don't buy
❌ Or worse: system uses $5,000 in quotes/invoices
Practical Scenario:
E-commerce team is updating the website:
Pulls product list from catalog
Sees quality badge on "Mountain Bike": "Price validation failed"
Holds off on publishing that product
Escalates to pricing team to verify
Prevents embarrassing/costly error
Tables Involved:
HumanResources.Employee(employee records)HumanResources.Department(department master list)
SQL Query:
SELECT COUNT(*) AS total_count, ...
FROM HumanResources.Employee e
LEFT JOIN HumanResources.Department d ON e.DepartmentID = d.DepartmentID
WHERE d.DepartmentID IS NULL -- Orphaned employeesChoice:
✅ Select: HumanResources.Employee as target
❌ Don't Select: HumanResources.Department as target
Why HumanResources.Employee?
Dependent table - Employee depends on Department (has the foreign key)
Employee records are what users view - HR looks up employees, not departments
Problem location - The employee record has the broken reference
Remediation target - Need to fix employee records (reassign to valid dept)
What Happens in PDC:
Scenario: 3 employees are assigned to "Department 99" which no longer exists (was merged into another dept)
If you selected HumanResources.Employee:
✅ HR manager searches employee "Jane Doe"
✅ Sees: "⚠️ Invalid department assignment"
✅ Reassigns Jane to correct department
✅ Payroll processes correctly (tied to department)
If you selected HumanResources.Department instead:
❌ Department table shows...nothing useful (the department doesn't exist!)
❌ HR manager looks at employee, sees no warning
❌ Payroll runs, fails for those 3 employees
❌ Paychecks delayed
Business Impact:
Quarterly headcount report is generated:
Pulls from Employee table
Catalog shows quality issue: "3 employees with invalid dept"
Report team fixes data BEFORE generating report
Executive presentation has accurate numbers
No awkward "our numbers don't add up" moment
Business Problem
Ensure inventory quantities are within reasonable ranges (not negative, not impossibly high).
Tables Involved:
Production.ProductInventory(only this table!)
SQL Query:
SELECT COUNT(*) AS total_count, ...
FROM Production.ProductInventory
WHERE Quantity < 0 OR Quantity > 10000Choice:
✅ Select: Production.ProductInventory as target (This is actually a single-table rule, so no alternative!)
Why ProductInventory?
Only table involved - Rule only queries this one table
Direct data source - Warehouse team queries this table for stock levels
Operational data - This is the "working" table that drives decisions
What Happens in PDC:
Scenario: Someone accidentally entered "1000000" instead of "100" for product quantity
If you selected ProductInventory:
✅ Warehouse manager checks stock levels
✅ Sees quality indicator: "⚠️ Invalid quantities detected"
✅ Reviews flagged records before planning reorders
✅ Corrects data before ordering product
x
x
x
Last updated
Was this helpful?
