Key Concepts & Terminology
Online Analytical Processing
What is OLAP? Online Analytical Processing (OLAP) systems are specialized databases designed for analyzing large volumes of data quickly and efficiently. Unlike OLTP (Online Transaction Processing) systems that handle day-to-day transactions like order processing or inventory updates, OLAP systems focus exclusively on reading and analyzing data. This read-only approach, combined with pre-aggregated data and multidimensional structures, allows OLAP to provide consistently fast query performance for business intelligence and decision-making.

Key Distinguishing Features
OLAP systems differ from traditional relational databases in four fundamental ways:
First, they use multidimensional data structures (often called "cubes") that organize data across multiple business dimensions like time, geography, products, and customers.
Second, they ensure consistently fast data access through pre-aggregation and optimized architectures.
Third, they provide intuitive interfaces that enable both technical analysts and business users to explore data without IT assistance.
Finally, they support complex cross-dimensional calculations, such as comparing current sales as a percentage of total sales across different time periods and regions.
Multi-dimensional
Business users naturally think and communicate using business terms like "sales by region," "quarterly costs," and "customer segments." They don't think in terms of database tables, foreign keys, or SQL joins. Traditional relational databases force users to understand complex table relationships and translate their business questions into technical queries, creating a significant barrier to data access.
OLAP removes this friction by aligning data structures with business language. Users can simply select "Products" and "Revenue," filter by "Region = Northeast" and "Time = Q3 2024," and get instant answers. The underlying complexity is completely hidden. This means business users can explore data independently without waiting days for IT to write SQL queries.
The traditional approach creates bottlenecks: a sales director asks a simple question, submits a request to IT, waits 2-3 days for a response, receives a static report, realizes they need additional information, and must start the process over. By the time they get answers, business opportunities have passed. OLAP transforms this into a self-service process.

x
Data Pipeline
The data pipeline begins with ETL processes that extract operational data from source systems, transform it into dimensional structures, and load it into the data warehouse. Tools like Pentaho Data Integration orchestrate this flow by connecting to disparate sources (ERP, CRM, transactional databases), applying business rules and data quality transformations, managing slowly changing dimensions, and conforming dimension attributes across sources to ensure consistency.
ETL typically employs staging areas as intermediate storage zones where raw extracted data lands before transformation, enabling delta load processing that identifies only changed records since the last load rather than full reloads. This staging approach is critical for scalability—incremental loads compare staging data against existing data mart records using timestamps, change data capture, or hash comparisons to determine inserts, updates, and deletes, then apply only these deltas to the dimensional model, minimizing processing time and database load.
Data marts serve as the foundational physical layer in ROLAP architecture, representing subject-specific dimensional models (star or snowflake schemas) optimized for particular business domains like Sales, Finance, or Inventory. Each data mart contains fact tables with measures and foreign keys, surrounded by denormalized dimension tables that support efficient querying.
The architecture is designed around data marts because they provide focused analytical contexts with manageable complexity, enable parallel development where different teams can build domain-specific marts independently, support incremental deployment rather than requiring a complete enterprise warehouse upfront, and allow performance optimization tailored to specific query patterns. Conformed dimensions (like Time, Geography, Customer) shared across multiple marts enable cross-functional analysis while maintaining independent mart structures.

Mondrian sits on top this data mart foundation as the ROLAP engine, with XML schemas that map each data mart's physical tables to logical business concepts—Cubes, Dimensions, Hierarchies, and Measures. When users query through OLAP client tools, Mondrian translates MDX into SQL that executes directly against the data mart's relational tables.
This architecture leverages standard RDBMS capabilities (indexing, partitioning, query optimization) while providing multidimensional abstraction, making it essential that ETL processes deliver clean, well-structured dimensional models and that data marts are designed with ROLAP query patterns in mind—denormalized dimensions for fewer joins, appropriate indexing on foreign keys and filter columns, and aggregate tables for common summarizations that Mondrian can transparently utilize for performance optimization.
Normalized v Star Schemas
In normalized databases designed for OLTP, a simple question like "What were total sales by product category and region last quarter?" requires joining 10+ tables. You'd need to connect Products, Subcategories, Categories, LineItems, Orders, Customers, Addresses, Cities, States, Countries, and Regions tables. The resulting SQL query becomes complex and difficult to write, understand, and maintain.

A star schema dramatically simplifies this structure. It organizes data into a central fact table (like OrderFacts) surrounded by dimension tables (DimProduct, DimGeography, DimDate, DimCustomer)—resembling a star shape. The fact table contains numeric measures (quantity, revenue, cost, profit) and foreign keys linking to dimensions. Each dimension table is denormalized, storing all related attributes in a single table.

In the star schema, dimension tables are intentionally denormalized. Instead of separating products into Products, Subcategories, and Categories tables, DimProduct contains ProductName, Subcategory, and Category all in one table. Similarly, DimGeography includes City, State, Country, and Region together. This redundancy trades storage efficiency for query performance and simplicity.
The same query that required 11 table joins in a normalized schema needs only 3-4 joins in a star schema. This means faster queries, clearer structure, and easier understanding even for non-technical users. Database optimizers are specifically designed to handle star joins efficiently, further improving performance.
Star schemas also make it easy to create and use pre-aggregated summary tables. You can maintain fact tables at different levels of detail—daily sales, weekly sales, monthly sales—and the OLAP engine automatically queries the appropriate level. This dramatically reduces query time for high-level reports.
ROLAP Mondrian
Mondrian is an open source ROLAP (Relational Online Analytical Processing) engine that provides access to data in a way that’s intuitive to users. As an engine, Mondrian can be run in a web container, such as Tomcat or JBoss (WildFly), or be embedded as part of an application.
Mondrian only requires an optional configuration, a schema defining the logical structure of the data, and a database populated with data. Mondrian works with most databases that support Java database connections.

The schema file is Mondrian's most critical component—it defines the logical business structure of your data. Written in XML format, the schema maps relational database tables to multidimensional concepts that users understand. This is where you define dimensions, hierarchies, measures, and calculated metrics. The schema essentially teaches Mondrian how to translate business language into database queries.
A schema file defines cubes, which are multidimensional representations of your data. For example, a Sales cube might include dimensions for Time, Geography, Product, and Customer, with measures like Revenue, Quantity, and Profit. Each dimension is defined with its hierarchies - the Geography dimension might have a hierarchy of Region → Country → State → City. The schema specifies which database tables and columns contain this information.
The schema also defines how dimensions relate to the fact table. For a Sales cube, you'd specify that the fact table is "sales_fact" and that it joins to "dim_product" using the product_key column, to "dim_geography" using geography_key, and so on. Mondrian uses these definitions to automatically construct SQL joins when users query the cube. Advanced schema features include calculated members (like "Profit Margin = Profit / Revenue"), custom aggregation functions, and security rules.
Here's a simplified example of what a schema looks like:
This schema tells Mondrian that there's a Sales cube based on the sales_fact table, with Product and Time dimensions that have specific hierarchies, and Revenue and Quantity measures that should be summed. When a user asks for "Revenue by Product Category and Year," Mondrian knows exactly which tables to query and how to aggregate the data.
Last updated
Was this helpful?
