AutoML
Use PDI + H2O AutoML in Colab to prototype a credit-card fraud model.
Imagine that a direct retailer wants to reduce losses due to orders involving fraudulent use of credit cards. They accept orders via phone and their website, and ship goods directly to the customer.
Basic customer details, such as customer name, date of birth, billing address and preferred shipping address, are stored in a relational database.
Orders, as they come in, are stored in a database. There is also a report of historical instances of fraud contained in a CSV spreadsheet.
Before you start ..
You need Colab access and a working PDI environment. Do the setup first if you have not done it yet.
Complete Prerequisite tasks.
In PDI, you will run
autoML.ktr.Location:
~/Workshop--Data-Integration/Labs/Module 7 - Machine Learning/AutoML
From that transformation, you will create
data/H2O.csv.Upload that file to Colab when prompted.
In this workshop, you will:
Prepare data (wrangling).
Create features (feature engineering).
Use H2O AutoML to shortlist candidate models.
Train and evaluate a model in Colab.
Save the best model artifact.

Run through the following steps to determine the best ML model for the dataset:
Data preparation
Use PDI to join customer, transaction, and historical fraud data. Create a single training dataset for AutoML.
Start PDI
Open the transformation:
autoML.ktr
~/Workshop--Data-Integration/Labs/Module 7 - Machine Learning/AutoML

Browse the various customer data sources:
Feature engineering
Feature engineering creates new fields that improve model signal. Here, you derive fields like age, order time-of-day, and zip-code match.
Example derived field:
billing_shipping_zip_equal = [customer_billing_zip] = [ship_to_zip]

There are steps for deriving additional fields that might be useful for predictive modeling. These include computing the customer's age, extracting the hour of the day the order was placed, and setting a flag to indicate whether the shipping and billing addresses have the same zip code.
H2O
So, what does the data scientist do at this point?
Typically, they will want to get a feel for the data by examining simple summary statistics and visualizations, followed by applying quick techniques for assessing the relationship between individual attributes (fields) and the target of interest which, in this example, is the reported_as_fraud_historic field.
Following that, if there are attributes that look promising, quick tests with common supervised classification algorithms will be next on the list. This comprises the initial stages of experimental data mining – that is, the process of determining which predictive techniques are going to give the best result for a given problem.
Create the H2O dataset
Run the transformation: autoML.ktr
Preview the results:
data/H2O.csv

This will be the dataset used for autoML in Colab.
H2O AutoML trains and ranks multiple models for you. It also builds ensembles and returns a leaderboard, so you can pick a strong baseline quickly.
Connect a runtime
Connect to a hosted runtime.

Upload the notebook
In Colab, select File > Upload notebook.
Upload:
~/Workshop--Data-Integration/Labs/Module 7 - Use Cases/Machine Learning/AutoML/data/credit_card_fraud.ipynb

Upload the dataset
When the notebook prompts for a file upload, upload H2O.csv. Create it by running autoML.ktr in PDI.
AutoML Script
These are the code sections for the Jupyter file: credit_card_fraud.ipynb:
Install the h2o libraries:
Installs H2O.ai's machine learning platform
-qflag = quiet mode (minimal output)
Import libraries:
max_mem_size='4G': Allocates 4GB RAM to H2O clusternthreads=-1: Uses all available CPU cores
Upload data:
Purpose: Interactive file upload widget in Google Colab
Expected file: H2O.csv (semicolon-separated)

Load data:
Critical steps:
Custom separator: Uses
;instead of default,Manual column naming: Original file has no headers
Target conversion: Converts
fraudto string for H2O classification
Feature breakdown:
first_time_customer: Binary indicator (risky for fraud)order_dollar_amount: Transaction valuenum_items: Cart sizeage: Customer ageweb_order: Online vs in-storetotal_transactions: Customer historyhour_of_day: Timing patterns (fraud often occurs at odd hours)billing_shipping_match: Address mismatch (major fraud signal)fraud: Target variable (0=legitimate, 1=fraud)
Your upstream transformation may use a different target field name (for example, reported_as_fraud_historic). In this notebook, the target column is named fraud.
Distribution check:
Purpose: Reveals class imbalance
Why this matters:
Fraud datasets are typically highly imbalanced (99% legitimate, 1% fraud)
This visualization confirms the need for
balance_classes=TruelaterShows if we have enough fraud cases to train effectively
Convert to H2O frame:
Key operations:
H2OFrame conversion: Moves data into H2O's distributed framework
Factor conversion: Tells H2O this is classification (not regression)
75/25 split: Standard training/testing division
Seed=42: Ensures reproducible results - controls randomness
Feature/target separation: Prepares for model training
Why this split?:
75% gives enough data to learn patterns
25% provides robust performance estimates
Random splitting prevents temporal bias
Run Models (Top 5)
Configuration explained:
max_models=5: keeps runtime short.max_runtime_secs=900: hard stop at 15 minutes.balance_classes=True: important for imbalanced fraud data.nfolds=5: cross-validation for more stable metrics.sort_metric='AUC': simple default ranking metric.
Algorithm selection rationale:
GBM (Gradient Boosting Machine)
Usually #1 performer for fraud
Handles complex patterns
Native H2O implementation
XGBoost
Industry standard for tabular data
Fast training
Excellent with imbalanced data
GLM (Generalized Linear Model)
Fast baseline
Interpretable coefficients
Good for linear relationships
DRF (Distributed Random Forest)
Ensemble of decision trees
Handles non-linear patterns
Robust to outliers
DeepLearning
Neural network
Captures complex interactions
May find unexpected patterns
Why no standalone Decision Trees? Ensemble methods (GBM, XGBoost, DRF) combine hundreds of trees, always outperforming single trees.
Model Leaderboard:
Output includes:
Model IDs
AUC (Area Under ROC Curve)
AUCPR (Area Under Precision-Recall Curve)
Mean per-class error
RMSE/MSE (less relevant for classification)
Best metric for fraud: AUCPR (handles class imbalance better than AUC)
Get Best Model:
Metrics explained for fraud detection:
AUC (0.95+ is excellent): Overall discrimination ability
Accuracy: Misleading with imbalanced data (can be high even missing all fraud)
Precision: Of flagged transactions, what % are actually fraud?
Recall: Of all fraud cases, what % did we catch? (Most critical for fraud)
F1 Score: Balance between precision and recall
Real-world tradeoff:
High recall = catch more fraud (but more false alarms)
High precision = fewer false alarms (but miss more fraud)
Confusion Matrix:
Business impact:
FN (False Negatives): Direct financial loss from undetected fraud
FP (False Positives): Customer friction from declined legitimate transactions
Optimal balance: Depends on fraud costs vs customer experience impact
Feature Importance:
Why this matters:
Explainability: Understand what drives fraud predictions
Feature engineering: Focus efforts on most impactful features
Regulatory compliance: Many industries require model interpretability
Data quality: Verify that sensible features are important
Expected top features:
billing_shipping_match(address verification)order_dollar_amount(unusually large transactions)total_transactions(customer history)
Save Model:
What this saves: an H2O model artifact you can load back into H2O.
Deployment options:
Load in production H2O cluster
Export a MOJO for low-latency scoring (separate step)
Export to PMML if your workflow requires it (separate step)
Summary Report:
Purpose: quick documentation for:
Model tracking
Stakeholder reporting
Experiment comparison
Regulatory audit trails
After you validate the notebook, you can move the logic into PDI. Use the AutoML results to decide which models to operationalize.
Python Executor
The Python Executor step lets you run Python code as part of a Pentaho Data Integration (PDI) transformation.
This step is designed to help developers and data scientists focus on Python-based analytics and algorithms while using PDI for common ETL work such as connecting to sources, joining, and filtering.
You can run Python with either:
Row-by-row processing: PDI maps each incoming row to Python variables and runs the script once per row.
All-rows processing: PDI transfers the full dataset at once (for example, into a pandas DataFrame) and runs the script.
This step supports the CPython runtime only.
For further details:
Click the Input tab.
Map PDI fields into Python variables.
Use All rows for DataFrames. It processes the full dataset at once.
Select All rows to process all data at once (for example, when passing a list of dicts).
Available variables
Use the plus sign button to add a Python variable to the input mapping for the script used in the transformation. You can remove the Python variable by clicking the X icon.
Variable name
Enter the name of the Python variable. The list of available variables will update automatically.
Step
Specify the name of the input step to map from. It can be any step in the parent transformation with an outgoing hop connected to the Python Executor step.
Data structure
Specify the data structure from which you want to pull the fields for mapping. You can select one of the following:
· Pandas data frame: the tabular data structure for Python/Pandas.
· NumPy array: the table of values, all the same type, which is indexed by a tuple of positive integers.
· Python List of Dictionaries: each row in the PDI stream becomes a Python dictionary. All the dictionaries are put into a Python list.
In Mapping, configure these fields:
Data structure field
The value of the Python data structure field to which you want to map the PDI field.
Data structure type
The value of the data structure type assigned to the data structure field to which you want to map the PDI field.
PDI field
The name of the PDI field which contains the vector data stored in the mapped Python variable.
PDI data type
The value of the data type assigned to the PDI field, such as a date, a number, or a timestamp.
Gradient Boosting Machine
Gradient Boosting Machine (GBM) is an ensemble learning algorithm that builds a series of weak decision trees sequentially, where each new tree focuses on correcting the errors made by the previous ones. The final prediction is a weighted combination of all these trees, resulting in a highly accurate model.
In fraud detection, GBM is particularly effective because it handles the inherent class imbalance (fraud being rare compared to legitimate transactions) and can capture complex, non-linear patterns in transaction data - such as unusual spending amounts, atypical merchant categories, or irregular timing - that simple rule-based systems would miss.
It also provides feature importance rankings, allowing analysts to understand which variables (e.g., transaction velocity, geolocation mismatches, or device fingerprints) are the strongest indicators of fraudulent behavior, making the model both powerful and interpretable for investigation teams.

Gradient Boosting Machine works through an iterative, additive process:
Step 1: Start with a baseline. The algorithm begins with a simple initial prediction, often just the average of the target variable (e.g., "fraud" or "not fraud").
Step 2: Calculate the errors (residuals). It measures how far off that initial prediction is from the actual values. These errors are called residuals.
Step 3: Fit a weak learner to the residuals. A small, shallow decision tree is trained - not on the original target, but on the errors from the previous step. The goal is to learn the pattern in what the model got wrong.
Step 4: Update the prediction. The output of this new tree is added to the existing prediction, scaled down by a learning rate (a small number like 0.1) to prevent overcorrecting. So the updated model becomes:
New Prediction = Old Prediction + (learning rate × New Tree's output)
Step 5: Repeat. Steps 2–4 are repeated for hundreds or thousands of iterations. Each new tree corrects a little more of the remaining error.
The "gradient" in the name comes from the fact that the algorithm uses gradient descent - the same optimization technique used in neural networks - to minimize a loss function (e.g., log-loss for classification). The residuals it fits at each step are actually the negative gradients of that loss function.
Over many rounds, these individually weak trees combine into a very strong predictive model. In fraud detection, this means early trees might learn broad patterns (like "very large transactions are riskier"), while later trees pick up subtle, nuanced signals (like "this card was used in two countries within an hour") that distinguish genuine fraud from normal behavior.
Install GBM
Execute the following script to install GBM.

Verify.
Metrics cheat sheet (fraud detection)
Use these metrics to compare models and pick a decision threshold. Fraud data is usually highly imbalanced.
What to look at first
AUCPR: best single-number metric for imbalanced classification.
Recall: how much fraud you catch (minimize false negatives).
Precision: how many alerts are real fraud (minimize false positives).
Use AUC as a sanity check
AUC measures how well the model ranks fraud above legitimate transactions. It can look “good” even when precision is poor at useful thresholds.
Why accuracy is a trap
If fraud is 1% of transactions, a model can be 99% accurate and useless. It can do that by predicting “legitimate” for every row.
Practical selection guide
If fraud losses are expensive, prioritize recall.
If customer friction is expensive, prioritize precision.
Use AUCPR to compare candidates before tuning thresholds.
What to ignore (most of the time)
RMSE/MSE: regression-style errors on probabilities. Not decision-friendly here.
Last updated
Was this helpful?



