Business Rules

Data Quality Rules ..

Business Rules

Business rules serve as critical data quality controls within a data catalog by defining and enforcing standards for how data should be structured, validated, and maintained across an organization. These rules act as executable policies that automatically check data against predefined criteria—such as format requirements, value ranges, referential integrity, and business logic constraints—flagging or preventing quality issues before they propagate through downstream systems.

By embedding business rules directly into the data catalog, organizations create a centralized governance framework that ensures consistency, accuracy, and compliance while providing data stewards and users with clear visibility into quality thresholds and validation status. This proactive approach transforms the data catalog from a passive documentation tool into an active quality management system that continuously monitors and maintains the trustworthiness of enterprise data assets.

The ROI of Business Rules in PDC

1. Proactive Detection vs. Reactive Response

Without Rules: Problems discovered weeks/months later during reporting or by customers
With Rules: Issues detected within hours of data entry
Value: Reduced cost of fixing errors by 10-100x (earlier detection = cheaper fixes)

2. Automation of Manual Checks

Without Rules: Data analysts manually query databases weekly to check quality
With Rules: Automated daily/hourly checks with instant alerts
Value: Free up 20-40 hours per analyst per month for strategic work

3. Trust and Confidence in Data

Without Rules: Business users question every report
With Rules: Certified data with quality scores builds trust
Value: Faster decision-making, reduced meeting time debating data accuracy

4. Compliance and Auditability

Without Rules: Difficult to prove data governance practices
With Rules: Documented quality standards with execution history
Value: Pass audits, avoid regulatory fines, demonstrate due diligence

5. Cross-Team Alignment

Without Rules: Each team has different quality standards
With Rules: Centralized, agreed-upon quality definitions
Value: Reduced conflicts, consistent metrics across organization

The Cost of Poor Data Quality

Organizations typically experience:

Financial Impact: 15-25% of revenue lost due to poor data quality (Gartner)
Operational Inefficiency: Teams spend 30-40% of time fixing data issues
Missed Opportunities: Incorrect customer data leads to lost sales and poor customer experience
Compliance Risks: Regulatory violations due to inaccurate reporting
Decision-Making Errors: Bad data leads to wrong strategic decisions

Data Quality Dimensions

Data quality is multi-faceted. A single "is it good?" question isn't enough. Each dimension addresses a different question:

Dimension

Question It Answers

Business Impact

Completeness

Is all required data present?

Missing data prevents processes from completing

Accuracy

Does data reflect reality correctly?

Wrong data leads to wrong decisions

Consistency

Is data uniform across systems?

Inconsistency causes confusion and errors

Validity

Does data conform to rules/formats?

Invalid data breaks systems and processes

Uniqueness

Are records properly distinct?

Duplicates inflate counts and waste resources

Timeliness

Is data current enough for use?

Stale data makes decisions irrelevant

Conformity

Does data follow standards?

Non-standard data is hard to integrate

Single Table Target Selection

Pentaho Data Catalog has a key architectural constraint you must understand before creating rules:

You can only select ONE table or column as the target resource for a business rule.

This means:

When configuring the rule scope, you select a single table from the catalog
Quality scores and tags are applied ONLY to that selected table
The rule result appears in the catalog only for that one table/column

Your SQL query has NO limitations:

You can JOIN as many tables as needed
You can reference any tables in your database
Full SQL capabilities are available
Complex multi-table logic is supported

Practical Impact

Scenario: You want to validate that SalesOrderHeader.SubTotal equals the sum of SalesOrderDetail.LineTotal

What happens:

Your SQL Query:
- References: SalesOrderHeader AND SalesOrderDetail (2 tables)

Your Target Resource Selection in UI:
- Can only select: SalesOrderHeader OR SalesOrderDetail (1 table)
- NOT both

Result Applied:
- Quality tags, status, and scores only show on the selected table
- The other table won't show this rule's results in the catalog

Why This Matters

Catalog Visibility: Users browsing the catalog only see quality info on the target table
Tag Application: Only the target table gets tagged with quality status
Design Decisions: You must choose which table is "primary" for each rule
Workaround Planning: Multi-table scenarios require strategic rule design

Decision Framework: Choosing the Target Table

When your rule involves multiple tables, which one should you select as the target?

Decision Criteria

Question 1: Which table do users primarily interact with?

If users typically query SalesOrderHeader and rarely touch SalesOrderDetail
Select: SalesOrderHeader (the user-facing table)
Why: Quality visibility where users need it most

Question 2: Where does the data quality problem originate?

If calculation errors come from wrong SubTotal in header (not line items)
Select: SalesOrderHeader (the source of the problem)
Why: Tags and status flag the problematic table

Question 3: Which table is the "parent" in the relationship?

In SalesOrderHeader (1) → SalesOrderDetail (many) relationship
Select: SalesOrderHeader (the parent)
Why: Logical hierarchy - order owns its details

Question 4: Which table failure has bigger business impact?

If wrong subtotal in SalesOrderHeader affects invoicing (critical)
But wrong line item in SalesOrderDetail only affects reporting (less critical)
Select: SalesOrderHeader (higher business impact)
Why: Prioritize visibility for highest-risk data

Question 5: Which table is in the data lineage flow first?

If data flows: Source → SalesOrderHeader → SalesOrderDetail → Warehouse
Select: SalesOrderHeader (earlier in flow)
Why: Catch problems upstream before they propagate

Decision Matrix

This matrix shows real-world scenarios where your SQL query references multiple tables, but you can only select ONE as the target in Data Catalog.

Scenario

Target Table Selection

Rationale

Order subtotal = sum of line items

SalesOrderHeader

Parent table, user-facing, higher impact

Customer has valid address

Sales.Customer

Parent table, users browse customers not addresses

Product price consistency over time

Production.Product

Parent table, users search products not price history

Employee in valid department

HumanResources.Employee

Parent table, employee record is what users view

Inventory quantity valid

Production.ProductInventory

This is the table users query for stock levels

Business Problem

You need to validate that the SubTotal field in the order header actually equals the sum of all the line item prices for that order.

Tables Involved:

Sales.SalesOrderHeader (has the SubTotal field)
Sales.SalesOrderDetail (has individual line item prices)

SQL Query:

SELECT COUNT(*) AS total_count, ...
FROM Sales.SalesOrderHeader h
LEFT JOIN (
    SELECT SalesOrderID, SUM(LineTotal) AS Total
    FROM Sales.SalesOrderDetail
    GROUP BY SalesOrderID
) d ON h.SalesOrderID = d.SalesOrderID

The Choice:

✅ Select: SalesOrderHeader as target

❌ Don't Select: SalesOrderDetail as target

Why SalesOrderHeader?

Parent table - It's the "one" in a one-to-many relationship (one order has many line items)
User-facing - When someone searches for "order 12345", they look at the header, not individual line items
Higher business impact - The SubTotal is what appears on the invoice sent to customers
Field location - The field you're validating (SubTotal) is IN the SalesOrderHeader table

What Happens in PDC:

When the rule runs and finds problems:

✅ Users browsing SalesOrderHeader table see: "⚠️ Warning: Calculation accuracy 95%"
✅ Quality tags appear on SalesOrderHeader: "calculation_error", "needs_review"
❌ Users browsing SalesOrderDetail table see: Nothing (no quality indicator)

Practical Example:

Imagine a business analyst searching the data catalog:

Searches for "sales orders"
Finds SalesOrderHeader table
Sees quality badge: "Last validated: Today, Status: WARNING"
Reads: "10 orders have incorrect subtotals"
Now knows to investigate before using this data

If you had selected SalesOrderDetail instead:

That same analyst looking at orders wouldn't see any quality warning
They'd use bad data without knowing there's a problem
The warning would only show on the detail table (which they're not browsing)

Business Problem

You need to verify that every customer has a valid shipping address on file.

Tables Involved:

Sales.Customer (the customer record)
Person.BusinessEntityAddress (links customers to addresses)
Person.Address (the actual address data)

SQL Query:

SELECT COUNT(*) AS total_count, ...
FROM Sales.Customer c
LEFT JOIN Person.BusinessEntityAddress ba ON c.PersonID = ba.BusinessEntityID
LEFT JOIN Person.Address a ON ba.AddressID = a.AddressID
WHERE ba.AddressTypeID = 3  -- Shipping address

Choice:

✅ Select: Sales.Customer as target

❌ Don't Select: Person.Address as target

Why Sales.Customer?

Parent table - Customer is the primary entity you're evaluating
Users browse customers - Marketing, sales, and customer service teams search for customers, not addresses
Business question - "Do our customers have valid data?" not "Are our addresses valid?"
Actionable insight - You need to know WHICH customers are missing addresses, not which addresses are missing customers

What Happens in PDC:

Scenario: Rule finds 50 customers without shipping addresses

If you selected Sales.Customer:

✅ Customer table shows: "Customer data quality: 95% complete"
✅ Marketing manager searches customers, sees quality issue
✅ Tag on Customer table: "incomplete_shipping_data"
✅ Manager knows to fix customer records before campaign

If you selected Person.Address instead:

❌ Address table shows quality status (but nobody browses address table directly)
❌ Marketing manager searches customers, sees NO warning
❌ Runs campaign, discovers 50 customers can't receive shipments
❌ Too late - campaign already sent

Real-World Impact:

A customer service rep is looking up customer "John Smith":

Opens customer record in the catalog
Sees quality indicator: "⚠️ No shipping address"
Contacts customer BEFORE processing order
Prevents delivery failure

Business Problem

You want to flag products where the price changed by more than 50% (possible data entry error).

Tables Involved:

Production.Product (current product info)
Production.ProductListPriceHistory (historical prices)

SQL Query:

SELECT COUNT(*) AS total_count, ...
FROM Production.ProductListPriceHistory h
WHERE ABS(h.ListPrice - LAG(h.ListPrice) OVER (...)) / LAG(h.ListPrice) OVER (...) > 0.5

Choice:

✅ Select: Production.Product as target

❌ Don't Select: Production.ProductListPriceHistory as target

Why Production.Product?

Parent table - Product is the entity being evaluated
Users search products - People look up "Product ABC", not "price history for product ABC"
Current relevance - Price issues affect the PRODUCT, not the historical record
Actionable - Product manager needs to know "this product has pricing issues", not "this historical record has issues"

What Happens in PDC:

Scenario: Product "Mountain Bike" price jumped from $500 to $5,000 (likely typo - should be $550)

If you selected Production.Product:

✅ Product manager searches "Mountain Bike"
✅ Sees: "⚠️ Recent price change flagged for review"
✅ Investigates, finds $5,000 is error, corrects to $550
✅ Fixes before any customer sees wrong price

If you selected Production.ProductListPriceHistory instead:

❌ Price history table shows quality issue (but who browses that?)
❌ Product manager looks at current product, sees no warning
❌ Wrong price goes live on website
❌ Customers see $5,000 price, don't buy
❌ Or worse: system uses $5,000 in quotes/invoices

Practical Scenario:

E-commerce team is updating the website:

Pulls product list from catalog
Sees quality badge on "Mountain Bike": "Price validation failed"
Holds off on publishing that product
Escalates to pricing team to verify
Prevents embarrassing/costly error

Business Problem

Verify every employee is assigned to a department that exists in the system.

Tables Involved:

HumanResources.Employee (employee records)
HumanResources.Department (department master list)

SQL Query:

SELECT COUNT(*) AS total_count, ...
FROM HumanResources.Employee e
LEFT JOIN HumanResources.Department d ON e.DepartmentID = d.DepartmentID
WHERE d.DepartmentID IS NULL  -- Orphaned employees

Choice:

✅ Select: HumanResources.Employee as target

❌ Don't Select: HumanResources.Department as target

Why HumanResources.Employee?

Dependent table - Employee depends on Department (has the foreign key)
Employee records are what users view - HR looks up employees, not departments
Problem location - The employee record has the broken reference
Remediation target - Need to fix employee records (reassign to valid dept)

What Happens in PDC:

Scenario: 3 employees are assigned to "Department 99" which no longer exists (was merged into another dept)

If you selected HumanResources.Employee:

✅ HR manager searches employee "Jane Doe"
✅ Sees: "⚠️ Invalid department assignment"
✅ Reassigns Jane to correct department
✅ Payroll processes correctly (tied to department)

If you selected HumanResources.Department instead:

❌ Department table shows...nothing useful (the department doesn't exist!)
❌ HR manager looks at employee, sees no warning
❌ Payroll runs, fails for those 3 employees
❌ Paychecks delayed

Business Impact:

Quarterly headcount report is generated:

Pulls from Employee table
Catalog shows quality issue: "3 employees with invalid dept"
Report team fixes data BEFORE generating report
Executive presentation has accurate numbers
No awkward "our numbers don't add up" moment

Business Problem

Ensure inventory quantities are within reasonable ranges (not negative, not impossibly high).

Tables Involved:

Production.ProductInventory (only this table!)

SQL Query:

SELECT COUNT(*) AS total_count, ...
FROM Production.ProductInventory
WHERE Quantity < 0 OR Quantity > 10000

Choice:

✅ Select: Production.ProductInventory as target (This is actually a single-table rule, so no alternative!)

Why ProductInventory?

Only table involved - Rule only queries this one table
Direct data source - Warehouse team queries this table for stock levels
Operational data - This is the "working" table that drives decisions

What Happens in PDC:

Scenario: Someone accidentally entered "1000000" instead of "100" for product quantity

If you selected ProductInventory:

✅ Warehouse manager checks stock levels
✅ Sees quality indicator: "⚠️ Invalid quantities detected"
✅ Reviews flagged records before planning reorders
✅ Corrects data before ordering product

PreviousCanvas & Collections NextCompleteness Rule

Last updated 2 months ago

Was this helpful?

Good evening

Business Rules

Business Rules

Data Quality Dimensions

Anatomy of a Business Rule

Single Table Target Selection

Practical Impact

Why This Matters

Decision Framework: Choosing the Target Table

Decision Criteria

Decision Matrix

Business Problem

Business Problem

Business Problem

Business Problem

Business Problem