Sensitivity Level & Trust Scores

Sensitivity Level & Trust Score

Dataflow

The solution consists of three main components:

  1. Entity Extraction Tool - Extracts all entities with hierarchical names from OpenSearch

  2. Pentaho Data Integration - Joins your calculated values with entity data

  3. Bulk Update Tool - Updates Trust Score and Sensitivity via API or OpenSearch

Expected Outcomes

  • Automated bulk updates of Trust Score (0-100) and Sensitivity (HIGH/MEDIUM/LOW)

  • Support for schema, table, and column level updates

  • Validation and error reporting

  • Scalable solution for thousands of entities


x

Entity Extraction

The extraction process retrieves all entities from your data catalog with their hierarchical relationships intact.

What Gets Extracted:

  • Entity unique identifiers (UUIDs)

  • Entity types (SCHEMA/TABLE/COLUMN)

  • Hierarchical names for joining

  • Current Trust Score and Sensitivity values

  • Fully qualified domain names (FQDNs)

Learning Objectives:

  • Understand the entity extraction process

  • Extract all entities with hierarchical names from your data catalog

  • Analyze the extracted data structure

  • Prepare data for joining with calculated metrics

  1. Run the extraction script:

# Change to Key_Metrics directory
cd
cd /home/pdc/Projects/APIs/Key_Metrics

# Extract all entities with full details
extract-entities \
    --opensearch-url http://localhost:9200 \
    --output data/output/entity_extraction.csv \
    --verbose

Expected output:

🔍 Extracting all entities from data catalog...
📊 Processing batch 1/25 (100 entities)...
📊 Processing batch 2/25 (100 entities)...
...
✅ Extracted 1,247 entities to data/output/entity_extraction.csv

📈 Entity Summary:
   - COLUMN: 892
   - TABLE: 343
   - SCHEMA: 12
  1. Take a look at the entity_extraction.csv

Column
Description
Example

entity_id

Unique identifier

ef60e629-4261-4ce6-8635-961ca4b1b420

entity_type

Type of entity

SCHEMA, TABLE, COLUMN

entity_name

Entity's actual name

Employee

schema_name

Schema name for joining

HumanResources

table_name

Table name (empty for schemas)

Employee

column_name

Column name (empty for schemas/tables)

FirstName

fqdn

Internal fully qualified name

688cc7b9c5759eae5fdcba07/...

fqdn_display

Human-readable path

mssql:adventureworks2022/...

current_trust_score

Existing trust score

48

current_sensitivity

Existing sensitivity

HIGH

new_trust_score

For your calculated values

(empty)

new_sensitivity

For your calculated values

(empty)

  1. Run a Data Quality Analysis

# Analyze extracted data
echo "=== Entity Type Distribution ==="
cut -d',' -f2 data/output/entity_extraction.csv | tail -n +2 | sort | uniq -c

echo -e "\n=== Top 5 Schemas ==="
cut -d',' -f4 data/output/entity_extraction.csv | tail -n +2 | sort | uniq -c | sort -rn | head -5

echo -e "\n=== Entities with Complete Hierarchy ==="
grep -v ',,,,' data/output/entity_extraction.csv | wc -l

echo -e "\n=== Sample Complete Entities ==="
grep -v ',,,,' data/output/entity_extraction.csv | head -3

# Count by schema
cut -d',' -f4 data/output/entity_extraction.csv | sort | uniq -c | head -10

# Find all schemas
awk -F',' '$2=="SCHEMA" {print $4}' data/output/entity_extraction.csv | sort | uniq

# Find all tables in HumanResources schema
awk -F',' '$2=="TABLE" && $4=="HumanResources" {print $5}' data/output/entity_extraction.csv

# Find all columns in Employee table
awk -F',' '$2=="COLUMN" && $4=="HumanResources" && $5=="Employee" {print $6}' data/output/entity_extraction.csv
  1. x

  2. x

  3. x

3. Run

Run Complete Extraction

cd /home/pdc/Projects/APIs/Key_Metrics

# Extract all entities with full details
extract-entities --opensearch-url http://localhost:9200 --output data/output/entity_extraction.csv
Expected Output Messages
 Extracting all entities from data catalog...
✅ Extracted 1,247 entities to data/output/entity_extraction.csv
 Edit the 'new_trust_score' and 'new_sensitivity' columns with your desired values

 Entity Summary:
  COLUMN: 892
  SCHEMA: 12
  TABLE: 343
 Step 2: Analyze Extracted Data
View Sample Data
bash
# Check file was created
ls -la data/output/entity_extraction.csv

# View first 5 rows
head -5 data/output/entity_extraction.csv

# Count total entities
wc -l data/output/entity_extraction.csv
Understanding the CSV Structure
csv
entity_id,entity_type,entity_name,schema_name,table_name,column_name,fqdn,fqdn_display,current_trust_score,current_sensitivity,new_trust_score,new_sensitivity
Column Explanations:

entity_id: Unique ID needed for bulk updates
entity_type: SCHEMA/TABLE/COLUMN for filtering
entity_name: The actual name of the entity
schema_name: Schema name for joining
table_name: Table name for joining (empty for schema-level)
column_name: Column name for joining (empty for schema/table-level)
fqdn
: Internal fully qualified name
fqdn_display: Human-readable path
current_trust_score: Existing trust score (if any)
current_sensitivity: Existing sensitivity (if any)
new_trust_score: Empty - for your calculated values
new_sensitivity: Empty - for your calculated values
 Step 3: Data Quality Checks
Check for Missing Names
bash
# Count entities with missing schema names
grep -c ',,,' data/output/entity_extraction.csv

# View entities with complete hierarchical names
grep -v ',,,' data/output/entity_extraction.csv | head -10
Analyze Entity Distribution
bash
# Count by entity type
cut -d',' -f2 data/output/entity_extraction.csv | sort | uniq -c

# Count by schema
cut -d',' -f4 data/output/entity_extraction.csv | sort | uniq -c | head -10
Sample Analysis Commands
bash
# Find all schemas
awk -F',' '$2=="SCHEMA" {print $4}' data/output/entity_extraction.csv | sort | uniq

# Find all tables in HumanResources schema
awk -F',' '$2=="TABLE" && $4=="HumanResources" {print $5}' data/output/entity_extraction.csv

# Find all columns in Employee table
awk -F',' '$2=="COLUMN" && $4=="HumanResources" && $5=="Employee" {print $6}' data/output/entity_extraction.csv
 Step 4: Prepare Sample Data for Testing
Create Test Dataset
bash
# Extract first 50 entities for testing
head -51 data/output/entity_extraction.csv > data/output/test_entities.csv
Create Sample Join Data
bash
# Create a sample calculated metrics CSV for testing
cat > data/input/sample_calculated_metrics.csv << 'EOF'
schema_name,table_name,column_name,calculated_trust_score,calculated_sensitivity
HumanResources,,,75,HIGH
HumanResources,Employee,,85,MEDIUM
HumanResources,Employee,FirstName,90,LOW
HumanResources,Employee,LastName,90,LOW
HumanResources,Employee,EmailAddress,70,HIGH
Sales,,,80,MEDIUM
Sales,Customer,,85,LOW
Sales,Customer,CustomerID,95,LOW
EOF
 Step 5: Validate Extraction Results
Check Data Completeness
bash
# Verify all expected columns are present
head -1 data/output/entity_extraction.csv | tr ',' '\n' | nl

# Check for any parsing errors
grep -n 'ERROR\|WARN' data/output/entity_extraction.csv || echo "No errors found"
Verify Hierarchical Names
bash
# Check schema-level entities
echo "=== SCHEMA ENTITIES ==="
awk -F',' '$2=="SCHEMA" {print "Schema: " $4 " (ID: " $1 ")"}' data/output/entity_extraction.csv | head -5

# Check table-level entities  
echo "=== TABLE ENTITIES ==="
awk -F',' '$2=="TABLE" {print "Table: " $4 "." $5 " (ID: " $1 ")"}' data/output/entity_extraction.csv | head -5

# Check column-level entities
echo "=== COLUMN ENTITIES ==="
awk -F',' '$2=="COLUMN" {print "Column: " $4 "." $5 "." $6 " (ID: " $1 ")"}' data/output/entity_extraction.csv | head -5
 Common Issues & Solutions
Issue: "No entities found"
bash
# Check OpenSearch connection
curl -s "http://localhost:9200/pdc_entities/_search?size=1"

# Check if OpenSearch container is running
docker ps | grep opensearch
Issue: "Empty schema/table/column names"
This is normal for some entity types
Focus on entities with complete hierarchical names for joining
Issue: "Extraction takes too long"
bash
# Extract smaller subset for testing
extract-entities --opensearch-url http://localhost:9200 --output data/output/small_test.csv
# Then manually limit the query size in the extraction tool
 Section 3 Checklist
 Full entity extraction completed successfully
 CSV file created with expected structure
 Entity counts match expected numbers (schemas, tables, columns)
 Hierarchical names (schema_name, table_name, column_name) populated correctly
 Sample calculated metrics CSV created for testing
 Data quality checks completed
 No critical errors in extraction process
 Ready for Next Section
With entity extraction complete, you now have:

Complete entity inventory with IDs
Hierarchical names for joining
Current Trust Score and Sensitivity values
Test data for validation
Next: Section 4 - Pentaho Data Integration (Joining Process)

Last updated

Was this helpful?