# Business Rules

{% hint style="info" %}

#### Business Rules

Business rules serve as critical data quality controls within a data catalog by defining and enforcing standards for how data should be structured, validated, and maintained across an organization. These rules act as executable policies that automatically check data against predefined criteria—such as format requirements, value ranges, referential integrity, and business logic constraints—flagging or preventing quality issues before they propagate through downstream systems.&#x20;

By embedding business rules directly into the data catalog, organizations create a centralized governance framework that ensures consistency, accuracy, and compliance while providing data stewards and users with clear visibility into quality thresholds and validation status. This proactive approach transforms the data catalog from a passive documentation tool into an active quality management system that continuously monitors and maintains the trustworthiness of enterprise data assets.
{% endhint %}

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2F1k64f95otkg9oZeSthag%2Fimage.png?alt=media&#x26;token=9510c743-3cca-4f93-91b6-2decae7e33a7" alt=""><figcaption><p>Data Operations - Business Rule</p></figcaption></figure>

<details>

<summary>The ROI of Business Rules in PDC</summary>

**1. Proactive Detection vs. Reactive Response**

* **Without Rules:** Problems discovered weeks/months later during reporting or by customers
* **With Rules:** Issues detected within hours of data entry
* **Value:** Reduced cost of fixing errors by 10-100x (earlier detection = cheaper fixes)

**2. Automation of Manual Checks**

* **Without Rules:** Data analysts manually query databases weekly to check quality
* **With Rules:** Automated daily/hourly checks with instant alerts
* **Value:** Free up 20-40 hours per analyst per month for strategic work

**3. Trust and Confidence in Data**

* **Without Rules:** Business users question every report
* **With Rules:** Certified data with quality scores builds trust
*

```
<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FqNskr6Fx7UXXb2yPMtS7%2Fimage.png?alt=media&#x26;token=0de6e026-b897-47fe-9e2f-80ce09b478a1" alt=""><figcaption></figcaption></figure>
```

* **Value:** Faster decision-making, reduced meeting time debating data accuracy

**4. Compliance and Auditability**

* **Without Rules:** Difficult to prove data governance practices
* **With Rules:** Documented quality standards with execution history
* **Value:** Pass audits, avoid regulatory fines, demonstrate due diligence

**5. Cross-Team Alignment**

* **Without Rules:** Each team has different quality standards
* **With Rules:** Centralized, agreed-upon quality definitions
* **Value:** Reduced conflicts, consistent metrics across organization

</details>

{% hint style="info" %}
**The Cost of Poor Data Quality**

Organizations typically experience:

* **Financial Impact:** 15-25% of revenue lost due to poor data quality (Gartner)
* **Operational Inefficiency:** Teams spend 30-40% of time fixing data issues
* **Missed Opportunities:** Incorrect customer data leads to lost sales and poor customer experience
* **Compliance Risks:** Regulatory violations due to inaccurate reporting
* **Decision-Making Errors:** Bad data leads to wrong strategic decisions
  {% endhint %}

***

{% tabs %}
{% tab title="7 Data Quality Dimensions" %}
{% hint style="info" %}

#### Data Quality Dimensions

Data quality is multi-faceted. A single "is it good?" question isn't enough. Each dimension addresses a different question:
{% endhint %}

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FOJGqo03wUISghxGzxlUV%2Fimage.png?alt=media&#x26;token=ba554729-f292-4d6f-b4ca-3df3cfb40a62" alt=""><figcaption><p>Data Quality Dimensions</p></figcaption></figure>

<table><thead><tr><th width="139">Dimension</th><th>Question It Answers</th><th>Business Impact</th></tr></thead><tbody><tr><td><a href="business-rules/completeness-rule">Completeness</a></td><td>Is all required data present?</td><td>Missing data prevents processes from completing</td></tr><tr><td><a href="business-rules/accuracy-rules">Accuracy</a></td><td>Does data reflect reality correctly?</td><td>Wrong data leads to wrong decisions</td></tr><tr><td><a href="business-rules/consistency-rules">Consistency</a></td><td>Is data uniform across systems?</td><td>Inconsistency causes confusion and errors</td></tr><tr><td><a href="business-rules/validity-rules">Validity</a></td><td>Does data conform to rules/formats?</td><td>Invalid data breaks systems and processes</td></tr><tr><td><a href="business-rules/uniqueness-rules">Uniqueness</a></td><td>Are records properly distinct?</td><td>Duplicates inflate counts and waste resources</td></tr><tr><td><a href="business-rules/timeliness-and-conformity-rules">Timeliness</a></td><td>Is data current enough for use?</td><td>Stale data makes decisions irrelevant</td></tr><tr><td><a href="business-rules/timeliness-and-conformity-rules">Conformity</a></td><td>Does data follow standards?</td><td>Non-standard data is hard to integrate</td></tr></tbody></table>
{% endtab %}

{% tab title="Anatomy of a Business Rule" %}
{% hint style="info" %}

#### Anatomy of a Business Rule

**Metadata** (Who, What, Why)

* Name: Identifies the rule
* Description: Explains purpose
* Dimension: Categorizes the quality aspect
* Tags: Enables searching and grouping
  {% endhint %}

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FqNskr6Fx7UXXb2yPMtS7%2Fimage.png?alt=media&#x26;token=0de6e026-b897-47fe-9e2f-80ce09b478a1" alt=""><figcaption><p>Data Quality Dimension</p></figcaption></figure>

{% hint style="info" %}
**Scope** (Where to Check)

* Target table(s) and column(s)
* SQL query defining what data to examine
* Filters to narrow focus (e.g., only active customers)
  {% endhint %}

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FO8zilgIxd6t11anTyIJ2%2Fimage.png?alt=media&#x26;token=222c5a75-ddd6-469e-9f62-ecb07d29d0f4" alt=""><figcaption><p>Set Scope - SQL Query</p></figcaption></figure>

{% hint style="info" %}
**Logic** (How to Measure)

* SQL query returning three critical numbers:
  * `total_count`: All records in scope
  * `scopeCount`: Records that meet quality standards
  * `nonCompliant`: Records that fail quality standards
    {% endhint %}

<figure><img src="https://1051758685-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fw1qJj4OGmdcvowiklB9W%2Fuploads%2FSg5J8xlprmD9hv1Gj5sK%2Fimage.png?alt=media&#x26;token=184e4995-8d8e-4dde-ac14-ad451ebf9c32" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
**Thresholds** (When to Act)

* Percentage or count that triggers PASS/WARNING/FAIL
* Different actions for different severity levels

**Actions** (What Happens Next)

* Set status for visibility
* Apply tags for filtering
* Trigger webhooks for integration (JIRA ticket, email notifications, etc)
  {% endhint %}
  {% endtab %}

{% tab title="Contraints" %}
{% hint style="danger" %}

#### Single Table Target Selection

Pentaho Data Catalog has a key architectural constraint you must understand before creating rules:

You can only select <mark style="color:red;">**ONE**</mark> table or column as the target resource for a business rule.

This means:

* When configuring the rule scope, you select a single table from the catalog
* Quality scores and tags are applied <mark style="color:red;">**ONLY**</mark> to that selected table
* The rule result appears in the catalog only for that one table/column

Your SQL query has <mark style="color:green;">**NO**</mark> limitation&#x73;**:**

* You can JOIN as many tables as needed
* You can reference any tables in your database
* Full SQL capabilities are available
* Complex multi-table logic is supported
  {% endhint %}

{% hint style="info" %}

#### **Practical Impact**

**Scenario:** You want to validate that `SalesOrderHeader.SubTotal` equals the sum of `SalesOrderDetail.LineTotal`
{% endhint %}

**What happens:**

```
Your SQL Query:
- References: SalesOrderHeader AND SalesOrderDetail (2 tables)

Your Target Resource Selection in UI:
- Can only select: SalesOrderHeader OR SalesOrderDetail (1 table)
- NOT both

Result Applied:
- Quality tags, status, and scores only show on the selected table
- The other table won't show this rule's results in the catalog
```

{% hint style="info" %}

#### **Why This Matters**

* **Catalog Visibility:** Users browsing the catalog only see quality info on the target table
* **Tag Application:** Only the target table gets tagged with quality status
* **Design Decisions:** You must choose which table is "primary" for each rule
* **Workaround Planning:** Multi-table scenarios require strategic rule design
  {% endhint %}

***

{% hint style="info" %}

#### Decision Framework: Choosing the Target Table

When your rule involves multiple tables, which one should you select as the target?
{% endhint %}

{% tabs %}
{% tab title="What questions to ask?" %}
{% hint style="info" %}

#### Decision Criteria

**Question 1: Which table do users primarily interact with?**

* If users typically query `SalesOrderHeader` and rarely touch `SalesOrderDetail`
* **Select:** `SalesOrderHeader` (the user-facing table)
* **Why:** Quality visibility where users need it most

**Question 2: Where does the data quality problem originate?**

* If calculation errors come from wrong `SubTotal` in header (not line items)
* **Select:** `SalesOrderHeader` (the source of the problem)
* **Why:** Tags and status flag the problematic table

**Question 3: Which table is the "parent" in the relationship?**

* In `SalesOrderHeader` (1) → `SalesOrderDetail` (many) relationship
* **Select:** `SalesOrderHeader` (the parent)
* **Why:** Logical hierarchy - order owns its details

**Question 4: Which table failure has bigger business impact?**

* If wrong subtotal in `SalesOrderHeader` affects invoicing (critical)
* But wrong line item in `SalesOrderDetail` only affects reporting (less critical)
* **Select:** `SalesOrderHeader` (higher business impact)
* **Why:** Prioritize visibility for highest-risk data

**Question 5: Which table is in the data lineage flow first?**

* If data flows: Source → `SalesOrderHeader` → `SalesOrderDetail` → Warehouse
* **Select:** `SalesOrderHeader` (earlier in flow)
* **Why:** Catch problems upstream before they propagate
  {% endhint %}

***

{% hint style="info" %}

#### Decision Matrix

This matrix shows real-world scenarios where your SQL query references multiple tables, but you can only select ONE as the target in Data Catalog.
{% endhint %}

<table data-full-width="true"><thead><tr><th width="289">Scenario</th><th width="254">Target Table Selection</th><th width="224">Rationale</th></tr></thead><tbody><tr><td>Order subtotal = sum of line items</td><td><code>SalesOrderHeader</code></td><td>Parent table, user-facing, higher impact</td></tr><tr><td>Customer has valid address</td><td><code>Sales.Customer</code></td><td>Parent table, users browse customers not addresses</td></tr><tr><td>Product price consistency over time</td><td><code>Production.Product</code></td><td>Parent table, users search products not price history</td></tr><tr><td>Employee in valid department</td><td><code>HumanResources.Employee</code></td><td>Parent table, employee record is what users view</td></tr><tr><td>Inventory quantity valid</td><td><code>Production.ProductInventory</code></td><td>This is the table users query for stock levels</td></tr></tbody></table>
{% endtab %}

{% tab title="Strategies" %}

{% endtab %}
{% endtabs %}

{% tabs %}
{% tab title="Order Subtotal" %}
{% hint style="info" %}

#### **Business Problem**&#x20;

You need to validate that the `SubTotal` field in the order header actually equals the sum of all the line item prices for that order.
{% endhint %}

**Tables Involved:**

* `Sales.SalesOrderHeader` (has the SubTotal field)
* `Sales.SalesOrderDetail` (has individual line item prices)

**SQL Query:**

```sql
SELECT COUNT(*) AS total_count, ...
FROM Sales.SalesOrderHeader h
LEFT JOIN (
    SELECT SalesOrderID, SUM(LineTotal) AS Total
    FROM Sales.SalesOrderDetail
    GROUP BY SalesOrderID
) d ON h.SalesOrderID = d.SalesOrderID
```

**The Choice:**&#x20;

✅ **Select:** `SalesOrderHeader` as target&#x20;

❌ **Don't Select:** `SalesOrderDetail` as target

**Why SalesOrderHeader?**

1. **Parent table** - It's the "one" in a one-to-many relationship (one order has many line items)
2. **User-facing** - When someone searches for "order 12345", they look at the header, not individual line items
3. **Higher business impact** - The SubTotal is what appears on the invoice sent to customers
4. **Field location** - The field you're validating (SubTotal) is IN the SalesOrderHeader table

**What Happens in PDC:**

When the rule runs and finds problems:

* ✅ Users browsing `SalesOrderHeader` table see: "⚠️ Warning: Calculation accuracy 95%"
* ✅ Quality tags appear on SalesOrderHeader: "calculation\_error", "needs\_review"
* ❌ Users browsing `SalesOrderDetail` table see: Nothing (no quality indicator)

**Practical Example:**

Imagine a business analyst searching the data catalog:

* Searches for "sales orders"
* Finds `SalesOrderHeader` table
* Sees quality badge: "Last validated: Today, Status: WARNING"
* Reads: "10 orders have incorrect subtotals"
* Now knows to investigate before using this data

If you had selected `SalesOrderDetail` instead:

* That same analyst looking at orders wouldn't see any quality warning
* They'd use bad data without knowing there's a problem
* The warning would only show on the detail table (which they're not browsing)
  {% endtab %}

{% tab title="Valid Address" %}
{% hint style="info" %}

#### **Business Problem**&#x20;

You need to verify that every customer has a valid shipping address on file.
{% endhint %}

**Tables Involved:**

* `Sales.Customer` (the customer record)
* `Person.BusinessEntityAddress` (links customers to addresses)
* `Person.Address` (the actual address data)

**SQL Query:**

```sql
SELECT COUNT(*) AS total_count, ...
FROM Sales.Customer c
LEFT JOIN Person.BusinessEntityAddress ba ON c.PersonID = ba.BusinessEntityID
LEFT JOIN Person.Address a ON ba.AddressID = a.AddressID
WHERE ba.AddressTypeID = 3  -- Shipping address
```

**Choice:**&#x20;

✅ **Select:** `Sales.Customer` as target&#x20;

❌ **Don't Select:** `Person.Address` as target

**Why Sales.Customer?**

1. **Parent table** - Customer is the primary entity you're evaluating
2. **Users browse customers** - Marketing, sales, and customer service teams search for customers, not addresses
3. **Business question** - "Do our customers have valid data?" not "Are our addresses valid?"
4. **Actionable insight** - You need to know WHICH customers are missing addresses, not which addresses are missing customers

**What Happens in PDC:**

Scenario: Rule finds 50 customers without shipping addresses

If you selected `Sales.Customer`:

* ✅ Customer table shows: "Customer data quality: 95% complete"
* ✅ Marketing manager searches customers, sees quality issue
* ✅ Tag on Customer table: "incomplete\_shipping\_data"
* ✅ Manager knows to fix customer records before campaign

If you selected `Person.Address` instead:

* ❌ Address table shows quality status (but nobody browses address table directly)
* ❌ Marketing manager searches customers, sees NO warning
* ❌ Runs campaign, discovers 50 customers can't receive shipments
* ❌ Too late - campaign already sent

**Real-World Impact:**

A customer service rep is looking up customer "John Smith":

* Opens customer record in the catalog
* Sees quality indicator: "⚠️ No shipping address"
* Contacts customer BEFORE processing order
* Prevents delivery failure
  {% endtab %}

{% tab title="Product Price" %}
{% hint style="info" %}

#### **Business Problem**&#x20;

You want to flag products where the price changed by more than 50% (possible data entry error).
{% endhint %}

**Tables Involved:**

* `Production.Product` (current product info)
* `Production.ProductListPriceHistory` (historical prices)

**SQL Query:**

```sql
SELECT COUNT(*) AS total_count, ...
FROM Production.ProductListPriceHistory h
WHERE ABS(h.ListPrice - LAG(h.ListPrice) OVER (...)) / LAG(h.ListPrice) OVER (...) > 0.5
```

**Choice:**&#x20;

✅ **Select:** `Production.Product` as target&#x20;

❌ **Don't Select:** `Production.ProductListPriceHistory` as target

**Why Production.Product?**

1. **Parent table** - Product is the entity being evaluated
2. **Users search products** - People look up "Product ABC", not "price history for product ABC"
3. **Current relevance** - Price issues affect the PRODUCT, not the historical record
4. **Actionable** - Product manager needs to know "this product has pricing issues", not "this historical record has issues"

**What Happens in PDC:**

Scenario: Product "Mountain Bike" price jumped from $500 to $5,000 (likely typo - should be $550)

If you selected `Production.Product`:

* ✅ Product manager searches "Mountain Bike"
* ✅ Sees: "⚠️ Recent price change flagged for review"
* ✅ Investigates, finds $5,000 is error, corrects to $550
* ✅ Fixes before any customer sees wrong price

If you selected `Production.ProductListPriceHistory` instead:

* ❌ Price history table shows quality issue (but who browses that?)
* ❌ Product manager looks at current product, sees no warning
* ❌ Wrong price goes live on website
* ❌ Customers see $5,000 price, don't buy
* ❌ Or worse: system uses $5,000 in quotes/invoices

**Practical Scenario:**

E-commerce team is updating the website:

* Pulls product list from catalog
* Sees quality badge on "Mountain Bike": "Price validation failed"
* Holds off on publishing that product
* Escalates to pricing team to verify
* Prevents embarrassing/costly error
  {% endtab %}

{% tab title="Employee Validation" %}
{% hint style="info" %}

#### **Business Problem**&#x20;

Verify every employee is assigned to a department that exists in the system.
{% endhint %}

**Tables Involved:**

* `HumanResources.Employee` (employee records)
* `HumanResources.Department` (department master list)

**SQL Query:**

```sql
SELECT COUNT(*) AS total_count, ...
FROM HumanResources.Employee e
LEFT JOIN HumanResources.Department d ON e.DepartmentID = d.DepartmentID
WHERE d.DepartmentID IS NULL  -- Orphaned employees
```

**Choice:**&#x20;

✅ **Select:** `HumanResources.Employee` as target&#x20;

❌ **Don't Select:** `HumanResources.Department` as target

**Why HumanResources.Employee?**

1. **Dependent table** - Employee depends on Department (has the foreign key)
2. **Employee records are what users view** - HR looks up employees, not departments
3. **Problem location** - The employee record has the broken reference
4. **Remediation target** - Need to fix employee records (reassign to valid dept)

**What Happens in PDC:**

Scenario: 3 employees are assigned to "Department 99" which no longer exists (was merged into another dept)

If you selected `HumanResources.Employee`:

* ✅ HR manager searches employee "Jane Doe"
* ✅ Sees: "⚠️ Invalid department assignment"
* ✅ Reassigns Jane to correct department
* ✅ Payroll processes correctly (tied to department)

If you selected `HumanResources.Department` instead:

* ❌ Department table shows...nothing useful (the department doesn't exist!)
* ❌ HR manager looks at employee, sees no warning
* ❌ Payroll runs, fails for those 3 employees
* ❌ Paychecks delayed

**Business Impact:**

Quarterly headcount report is generated:

* Pulls from Employee table
* Catalog shows quality issue: "3 employees with invalid dept"
* Report team fixes data BEFORE generating report
* Executive presentation has accurate numbers
* No awkward "our numbers don't add up" moment
  {% endtab %}

{% tab title="Inventory Validation" %}
{% hint style="info" %}

#### **Business Problem**

Ensure inventory quantities are within reasonable ranges (not negative, not impossibly high).
{% endhint %}

**Tables Involved:**

* `Production.ProductInventory` (only this table!)

**SQL Query:**

```sql
SELECT COUNT(*) AS total_count, ...
FROM Production.ProductInventory
WHERE Quantity < 0 OR Quantity > 10000
```

**Choice:**&#x20;

✅ **Select:** `Production.ProductInventory` as target (This is actually a single-table rule, so no alternative!)

**Why ProductInventory?**

1. **Only table involved** - Rule only queries this one table
2. **Direct data source** - Warehouse team queries this table for stock levels
3. **Operational data** - This is the "working" table that drives decisions

**What Happens in PDC:**

Scenario: Someone accidentally entered "1000000" instead of "100" for product quantity

If you selected `ProductInventory`:

* ✅ Warehouse manager checks stock levels
* ✅ Sees quality indicator: "⚠️ Invalid quantities detected"
* ✅ Reviews flagged records before planning reorders
* ✅ Corrects data before ordering product
  {% endtab %}
  {% endtabs %}
  {% endtab %}

{% tab title="Second Tab" %}

{% endtab %}
{% endtabs %}

x

{% tabs %}
{% tab title="First Tab" %}
x
{% endtab %}

{% tab title="Second Tab" %}

{% endtab %}
{% endtabs %}

x


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.pentaho.com/pentaho-data-catalog-en/data-catalog/business-rules.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
