Sensitivity Level & Trust Scores
Sensitivity Level & Trust Score
This hands-on workshop teaches you how to implement an automated solution for bulk updating Trust Scores and Sensitivity levels across your entire data catalog. You'll learn to extract entity data, calculate metrics, join data using Pentaho, and perform bulk updates efficiently.
By the end of this workshop, you will be able to:
Extract entity data from your data catalog with hierarchical names
Join calculated Trust Score and Sensitivity values using Pentaho Data Integration
Perform bulk updates across all schemas, tables, and columns
Validate and monitor the update process
Troubleshoot common issues

Entity Extraction
The extraction process retrieves all entities from your data catalog with their hierarchical relationships intact.
What Gets Extracted:
Entity unique identifiers (UUIDs)
Entity types (SCHEMA/TABLE/COLUMN)
Hierarchical names for joining
Current Trust Score and Sensitivity values
Fully qualified domain names (FQDNs)
Learning Objectives:
Understand the entity extraction process
Extract all entities with hierarchical names from your data catalog
Analyze the extracted data structure
Prepare data for joining with calculated metrics
Run the extraction script:
# Change to Key_Metrics directory
cd
cd /home/pdc/Projects/APIs/Key_Metrics
# Extract all entities with full details
extract-entities \
--opensearch-url http://localhost:9200 \
--output data/output/entity_extraction.csv \
--verbose
Expected output:
🔍 Extracting all entities from data catalog...
📊 Processing batch 1/25 (100 entities)...
📊 Processing batch 2/25 (100 entities)...
...
✅ Extracted 1,247 entities to data/output/entity_extraction.csv
📈 Entity Summary:
- COLUMN: 892
- TABLE: 343
- SCHEMA: 12
Take a look at the entity_extraction.csv
entity_id
Unique identifier
ef60e629-4261-4ce6-8635-961ca4b1b420
entity_type
Type of entity
SCHEMA, TABLE, COLUMN
entity_name
Entity's actual name
Employee
schema_name
Schema name for joining
HumanResources
table_name
Table name (empty for schemas)
Employee
column_name
Column name (empty for schemas/tables)
FirstName
fqdn
Internal fully qualified name
688cc7b9c5759eae5fdcba07/...
fqdn_display
Human-readable path
mssql:adventureworks2022/...
current_trust_score
Existing trust score
48
current_sensitivity
Existing sensitivity
HIGH
new_trust_score
For your calculated values
(empty)
new_sensitivity
For your calculated values
(empty)
Run a Data Quality Analysis
# Analyze extracted data
echo "=== Entity Type Distribution ==="
cut -d',' -f2 data/output/entity_extraction.csv | tail -n +2 | sort | uniq -c
echo -e "\n=== Top 5 Schemas ==="
cut -d',' -f4 data/output/entity_extraction.csv | tail -n +2 | sort | uniq -c | sort -rn | head -5
echo -e "\n=== Entities with Complete Hierarchy ==="
grep -v ',,,,' data/output/entity_extraction.csv | wc -l
echo -e "\n=== Sample Complete Entities ==="
grep -v ',,,,' data/output/entity_extraction.csv | head -3
# Count by schema
cut -d',' -f4 data/output/entity_extraction.csv | sort | uniq -c | head -10
# Find all schemas
awk -F',' '$2=="SCHEMA" {print $4}' data/output/entity_extraction.csv | sort | uniq
# Find all tables in HumanResources schema
awk -F',' '$2=="TABLE" && $4=="HumanResources" {print $5}' data/output/entity_extraction.csv
# Find all columns in Employee table
awk -F',' '$2=="COLUMN" && $4=="HumanResources" && $5=="Employee" {print $6}' data/output/entity_extraction.csv
x
x
x
3. Run
Run Complete Extraction
cd /home/pdc/Projects/APIs/Key_Metrics
# Extract all entities with full details
extract-entities --opensearch-url http://localhost:9200 --output data/output/entity_extraction.csv
Expected Output Messages
Extracting all entities from data catalog...
✅ Extracted 1,247 entities to data/output/entity_extraction.csv
Edit the 'new_trust_score' and 'new_sensitivity' columns with your desired values
Entity Summary:
COLUMN: 892
SCHEMA: 12
TABLE: 343
Step 2: Analyze Extracted Data
View Sample Data
bash
# Check file was created
ls -la data/output/entity_extraction.csv
# View first 5 rows
head -5 data/output/entity_extraction.csv
# Count total entities
wc -l data/output/entity_extraction.csv
Understanding the CSV Structure
csv
entity_id,entity_type,entity_name,schema_name,table_name,column_name,fqdn,fqdn_display,current_trust_score,current_sensitivity,new_trust_score,new_sensitivity
Column Explanations:
entity_id: Unique ID needed for bulk updates
entity_type: SCHEMA/TABLE/COLUMN for filtering
entity_name: The actual name of the entity
schema_name: Schema name for joining
table_name: Table name for joining (empty for schema-level)
column_name: Column name for joining (empty for schema/table-level)
fqdn
: Internal fully qualified name
fqdn_display: Human-readable path
current_trust_score: Existing trust score (if any)
current_sensitivity: Existing sensitivity (if any)
new_trust_score: Empty - for your calculated values
new_sensitivity: Empty - for your calculated values
Step 3: Data Quality Checks
Check for Missing Names
bash
# Count entities with missing schema names
grep -c ',,,' data/output/entity_extraction.csv
# View entities with complete hierarchical names
grep -v ',,,' data/output/entity_extraction.csv | head -10
Analyze Entity Distribution
bash
# Count by entity type
cut -d',' -f2 data/output/entity_extraction.csv | sort | uniq -c
# Count by schema
cut -d',' -f4 data/output/entity_extraction.csv | sort | uniq -c | head -10
Sample Analysis Commands
bash
# Find all schemas
awk -F',' '$2=="SCHEMA" {print $4}' data/output/entity_extraction.csv | sort | uniq
# Find all tables in HumanResources schema
awk -F',' '$2=="TABLE" && $4=="HumanResources" {print $5}' data/output/entity_extraction.csv
# Find all columns in Employee table
awk -F',' '$2=="COLUMN" && $4=="HumanResources" && $5=="Employee" {print $6}' data/output/entity_extraction.csv
Step 4: Prepare Sample Data for Testing
Create Test Dataset
bash
# Extract first 50 entities for testing
head -51 data/output/entity_extraction.csv > data/output/test_entities.csv
Create Sample Join Data
bash
# Create a sample calculated metrics CSV for testing
cat > data/input/sample_calculated_metrics.csv << 'EOF'
schema_name,table_name,column_name,calculated_trust_score,calculated_sensitivity
HumanResources,,,75,HIGH
HumanResources,Employee,,85,MEDIUM
HumanResources,Employee,FirstName,90,LOW
HumanResources,Employee,LastName,90,LOW
HumanResources,Employee,EmailAddress,70,HIGH
Sales,,,80,MEDIUM
Sales,Customer,,85,LOW
Sales,Customer,CustomerID,95,LOW
EOF
Step 5: Validate Extraction Results
Check Data Completeness
bash
# Verify all expected columns are present
head -1 data/output/entity_extraction.csv | tr ',' '\n' | nl
# Check for any parsing errors
grep -n 'ERROR\|WARN' data/output/entity_extraction.csv || echo "No errors found"
Verify Hierarchical Names
bash
# Check schema-level entities
echo "=== SCHEMA ENTITIES ==="
awk -F',' '$2=="SCHEMA" {print "Schema: " $4 " (ID: " $1 ")"}' data/output/entity_extraction.csv | head -5
# Check table-level entities
echo "=== TABLE ENTITIES ==="
awk -F',' '$2=="TABLE" {print "Table: " $4 "." $5 " (ID: " $1 ")"}' data/output/entity_extraction.csv | head -5
# Check column-level entities
echo "=== COLUMN ENTITIES ==="
awk -F',' '$2=="COLUMN" {print "Column: " $4 "." $5 "." $6 " (ID: " $1 ")"}' data/output/entity_extraction.csv | head -5
Common Issues & Solutions
Issue: "No entities found"
bash
# Check OpenSearch connection
curl -s "http://localhost:9200/pdc_entities/_search?size=1"
# Check if OpenSearch container is running
docker ps | grep opensearch
Issue: "Empty schema/table/column names"
This is normal for some entity types
Focus on entities with complete hierarchical names for joining
Issue: "Extraction takes too long"
bash
# Extract smaller subset for testing
extract-entities --opensearch-url http://localhost:9200 --output data/output/small_test.csv
# Then manually limit the query size in the extraction tool
Section 3 Checklist
Full entity extraction completed successfully
CSV file created with expected structure
Entity counts match expected numbers (schemas, tables, columns)
Hierarchical names (schema_name, table_name, column_name) populated correctly
Sample calculated metrics CSV created for testing
Data quality checks completed
No critical errors in extraction process
Ready for Next Section
With entity extraction complete, you now have:
Complete entity inventory with IDs
Hierarchical names for joining
Current Trust Score and Sensitivity values
Test data for validation
Next: Section 4 - Pentaho Data Integration (Joining Process)
Last updated
Was this helpful?