Canvas & Collections
Data Canvas & Collections
Before we start processing / profiling the data, let's cover the main features of the Data Canvas - the main viewer for displaying various aspects of the data.
The catalog is basically comprised of catagorized data assets, such as files, tables or schemas, etc. These can be grouped and defined as a Collection - related by purpose or business context.
Once enriched with, for example, quality scores, sensitivity levels, etc then the Collection can be published as a Data Product. This publishing action changes its lifecycle state and makes it available in global search results and for broader consumption. For example, a Customer Insights Dataset under the Marketing Category can be published as a Data Product once it includes profiling results, sensitivity tagging, and trust scoring.
Log into Data Catalog:
Username: [email protected]
Password: Welcome123!
Security Advisory: Handling Login Credentials
For enhanced security, it is strongly recommended that users avoid saving their login details directly in web browsers. Browsers may inadvertently autofill these credentials in unrelated fields, posing a security risk.
Best Practice
• Disable Autofill: To mitigate potential risks, users should disable the autofill functionality for login credentials in their browser settings. This preventive measure ensures that sensitive information is not unintentionally exposed or misused.
Highlight the connection: mssql:adventureworks2022.
Click on the Details tab:

To View schema details: highlight a schema and click on View:


At this stage we have only completed Stage 1: Metadata Ingestion.

Collections
A Collection is a way to logically group data assets, such as schemas, tables, and files, so that you can work with them more efficiently. Whether you are analyzing similar datasets or combining diverse data sources, Collections allow you to organize and manage your data entities based on structure or business use case.
Key Characteristics
Collections are designed for business users, not database administrators. They can pull tables from multiple database schemas and present them with business-friendly names and descriptions. Different user roles see different Collections based on their needs - sales teams see sales data, HR sees employee data.
Data Catalog supports two types of Collections:
Dataset: A Dataset is a group of homogeneous data assets, such as tables or files that share the same schema.
Data Collection: A Data Collection is a group of heterogeneous data assets, such as files, tables, or schemas, with different structures.
How is this different?
Instead of writing complex SQL joins across multiple schemas, users browse pre-organized Collections and drag-and-drop the data they need. The system automatically handles the technical complexity behind the scenes.
Business Benefits
Collections enable self-service analytics by making data accessible to business users without technical expertise. They ensure consistent data definitions across teams, improve data governance, and dramatically reduce the time from question to insight. Teams can focus on analysis rather than data preparation.
Customer Analytics Collection
A Customer Analytics Collection might include:
Person.Person- Basic personal informationPerson.Address- Customer addressesSales.Customer- Customer business dataSales.SalesOrderHeader- Purchase history summary
Business Value: This gives marketing teams everything they need to analyze customer behavior without understanding table relationships.
Select Collections & Create a New Category.

Select: Create Category & Create.

Repeat to Create a Group - Customer Analytics

Select: Data Canvas
Expand Person & Sales schemas & select:
Person
Person.Person
Person.Address
Sales
Sales.Customer
Sales.SalesOrderHeader
Click: 'Add to Cart'

Save as Collection.

Select: Customer - Parent Group & Create.

Remember to select: Collection - the tables span several schemas.
View the Details of your Collection.

Click on: Actions.
Customer 360
A Customer 360 Collection might include:
Tables Included:
Person.Person- Basic personal informationSales.Customer- Customer business dataPerson.Address- Customer addressesPerson.EmailAddress- Contact informationPerson.PersonPhone- Phone numbersSales.CustomerAddress- Address relationshipsSales.SalesOrderHeader- Purchase history summarySales.SalesTerritory- Geographic context
Business Value: This gives marketing teams everything they need to analyze customer behavior without understanding table relationships.
Financial Performance
For finance teams analyzing revenue, costs, and profitability:
Tables Included:
Sales.SalesOrderHeader- Order totals and datesSales.SalesOrderDetail- Line item detailsProduction.Product- Product costs and pricingSales.SalesTerritory- Regional performanceSales.SalesPerson- Salesperson commissionsPurchasing.PurchaseOrderHeader- Cost dataProduction.TransactionHistory- Inventory costs
Business Value: Finance can analyze profit margins, sales trends, and cost analysis across products and regions.
Supply Chain & Inventory
For operations teams managing inventory and procurement:
Tables Included:
Production.Product- Product specificationsProduction.ProductInventory- Current stock levelsPurchasing.PurchaseOrderHeader- Purchase ordersPurchasing.PurchaseOrderDetail- Purchase detailsPurchasing.Vendor- Supplier informationProduction.Location- Warehouse locationsProduction.TransactionHistory- Inventory movementsProduction.WorkOrder- Manufacturing orders
Business Value: Operations teams can monitor stock levels, supplier performance, and production scheduling.
Data Product
A Data Product is the published, production-ready version of a Collection or Dataset in Pentaho Data Catalog. It transforms working data assets into verified, discoverable resources available for broader organizational use through a formal publishing process.
From Collections to Data Product
Collections start as working drafts where data stewards organize related tables and add metadata. Once enriched with quality scores, sensitivity tags, business terms, and trust ratings, they can be published as Data Products. This publishing changes their lifecycle state and makes them globally searchable.
Customer Analytics Collection
The Customer Analytics Collection containing tables:
Person.Person- Basic personal informationPerson.Address- Customer addressesSales.Customer- Customer business dataSales.SalesOrderHeader- Purchase history summary
would undergo enrichment - receiving a 75% quality score, PII sensitivity tagging, high trust ratings, and business term mapping. Once these standards are met, it becomes the "Customer Analytics Data Product."
Publishing Criteria
Organizations typically set minimum thresholds like 60% data quality, assigned sensitivity levels, and acceptable trust scores. However, Collections can still be published even if they don't meet all criteria, giving organizations flexibility in their governance approach.
Key Benefits
Published Data Products become globally searchable, carry verified status indicating business readiness, and include enhanced metadata for filtering and discovery. This ensures only validated, well-documented data becomes widely available while maintaining proper access controls based on sensitivity levels.
Business Impact
The publishing process creates clear distinction between work-in-progress Collections and trusted business assets. This helps users identify reliable data for decision-making while ensuring sensitive information like customer PII is properly governed and protected.
x
x
x
Last updated
Was this helpful?
