# AutoML

{% hint style="info" %}
Imagine that a direct retailer wants to reduce losses due to orders involving fraudulent use of credit cards. They accept orders via phone and their website, and ship goods directly to the customer.

Basic customer details, such as customer name, date of birth, billing address and preferred shipping address, are stored in a relational database.

Orders, as they come in, are stored in a database. There is also a report of historical instances of fraud contained in a CSV spreadsheet.
{% endhint %}

{% hint style="warning" %}

#### **Before you start ..**

You need Colab access and a working PDI environment. Do the setup first if you have not done it yet.

* Complete [Prerequisite tasks](https://academy.pentaho.com/pentaho-data-integration/use-cases/machine-learning/prerequiste-tasks).
* In PDI, you will run `autoML.ktr`.
  * Location: `~/Workshop--Data-Integration/Labs/Module 7 - Machine Learning/AutoML`
* From that transformation, you will create `data/H2O.csv`.
  * Upload that file to Colab when prompted.
    {% endhint %}

{% embed url="<https://www.loom.com/share/52058cfd36b34a0286d5961ab393c9ad?hideEmbedTopBar=true&hide_owner=true&hide_share=true&hide_title=true>" %}
Walkthrough (video)
{% endembed %}

{% hint style="info" %}
In this workshop, you will:

* Prepare data (wrangling).
* Create features (feature engineering).
* Use H2O AutoML to shortlist candidate models.
* Train and evaluate a model in Colab.
* Save the best model artifact.
  {% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FlpxpehYAuc5AwcSUkW3N%2Fimage.png?alt=media&#x26;token=458c5226-364d-4e11-990b-3f72111f23d9" alt=""><figcaption><p>AutoML</p></figcaption></figure>

{% file src="<https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FlzUH4Kyr1F8V0V0HCeeM%2Fcustomer.csv?alt=media&token=59c2467d-f5ca-4553-b319-80e10ddb2a74>" %}

{% file src="<https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FYGOdfhSVceivbo7VzxBO%2Fcustomer_billing.csv?alt=media&token=58ccaf29-19fd-4219-b0d9-7dec9c87ba8c>" %}

{% file src="<https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FBD7Mr9vGB3uKnVCEXnpJ%2FH2O.csv?alt=media&token=3459d05b-dbc3-4e4f-9dde-72fc95e99ca1>" %}

{% file src="<https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FgKUwFhgvDIlHwcOzpszp%2Ftransaction_details.csv?alt=media&token=cd34cadb-80eb-48e7-bac8-6136c820c28f>" %}

{% file src="<https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fivdpw4pQcA1qlCRekBTd%2Ftransaction_fraud_report.csv?alt=media&token=9e23841d-f296-4af0-8aa8-1cd27fd6ec0b>" %}

{% file src="<https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FnVJfHPjwtnjoVBpZCSGs%2FautoML.ktr?alt=media&token=4626ebc3-92d6-4527-99c7-ce0bde72bc3a>" %}

***

Run through the following steps to determine the best ML model for the dataset:

{% tabs %}
{% tab title="1. Data Preparation" %}
{% hint style="info" %}

#### Data preparation

Use PDI to join customer, transaction, and historical fraud data. Create a single training dataset for AutoML.
{% endhint %}

1. Start PDI

```bash
cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh 
```

2. Open the transformation: `autoML.ktr`

`~/Workshop--Data-Integration/Labs/Module 7 - Machine Learning/AutoML`

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-5f19a81841b1443666da443d236366d243942dfa%2Fimage.png?alt=media" alt=""><figcaption><p>data wrangling</p></figcaption></figure>

3. Browse the various customer data sources:

{% tabs %}
{% tab title="1. customer\_data" %}
{% hint style="info" %}

#### **Customer Data**

Where you will find the customer\_billing\_zip codes, which will be used in feature engineering:
{% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-407d5276495cdc648a3831317f4e91dd58315674%2Fimage.png?alt=media" alt=""><figcaption><p>customer_data</p></figcaption></figure>
{% endtab %}

{% tab title="2. customer\_billing" %}
{% hint style="info" %}

#### **Customer Billing**

References the customer transaction data.
{% endhint %}
{% endtab %}

{% tab title="3. customer\_transaction" %}
{% hint style="info" %}

#### **Customer Transaction**

* customer transaction details
* feature engineering for `ship_to_zip`
* transaction details (`x` features) are used by the model to predict fraud (`y` target).
* boolean values must be converted to numbers for many algorithms (for example, Random Forest).
  {% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-3f58dbe091b902a84235f6cc5982b94fb97a3fd6%2Fimage.png?alt=media" alt=""><figcaption><p>customer transaction</p></figcaption></figure>
{% endtab %}

{% tab title="4. fraud\_details" %}
{% hint style="info" %}

#### **Fraud Details**

Indicates whether historically the transaction was fraudulent:
{% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-6446f4584ceb877e00bf2c99b36dd6b3bb580e7f%2Fimage.png?alt=media" alt=""><figcaption><p>fraud report</p></figcaption></figure>
{% endtab %}
{% endtabs %}
{% endtab %}

{% tab title="2. Feature engineering" %}
{% hint style="info" %}

#### Feature engineering

Feature engineering creates new fields that improve model signal. Here, you derive fields like age, order time-of-day, and zip-code match.
{% endhint %}

Example derived field:

`billing_shipping_zip_equal = [customer_billing_zip] = [ship_to_zip]`

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-c1211b899ac30b7bb5a75502e657f32d0aeb71a2%2Fimage%20(244).png?alt=media" alt=""><figcaption><p>Feature Engineering</p></figcaption></figure>

{% hint style="info" %}
There are steps for deriving additional fields that might be useful for predictive modeling. These include computing the customer's age, extracting the hour of the day the order was placed, and setting a flag to indicate whether the shipping and billing addresses have the same zip code.
{% endhint %}
{% endtab %}

{% tab title="3. H2O" %}
{% hint style="info" %}

#### H2O

So, what does the data scientist do at this point?

Typically, they will want to get a feel for the data by examining simple summary statistics and visualizations, followed by applying quick techniques for assessing the relationship between individual attributes (fields) and the target of interest which, in this example, is the `reported_as_fraud_historic` field.

Following that, if there are attributes that look promising, quick tests with common supervised classification algorithms will be next on the list. This comprises the initial stages of experimental data mining – that is, the process of determining which predictive techniques are going to give the best result for a given problem.
{% endhint %}

{% tabs %}
{% tab title="1. Dataset" %}
**Create the H2O dataset**

1. Run the transformation: autoML.ktr
2. Preview the results: `data/H2O.csv`

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FkYycKbl2pnisSV4okiIy%2Fimage.png?alt=media&#x26;token=d62f831e-d660-4db2-a3dd-ce80e89e06fb" alt=""><figcaption><p>H2O.csv</p></figcaption></figure>

{% hint style="info" %}
This will be the dataset used for autoML in Colab.
{% endhint %}
{% endtab %}

{% tab title="2. H2O" %}
{% hint style="info" %}
H2O AutoML trains and ranks multiple models for you. It also builds ensembles and returns a leaderboard, so you can pick a strong baseline quickly.
{% endhint %}

{% embed url="<https://www.loom.com/share/4a72fd0bf16446a7b7c742474ad242c0?hideEmbedTopBar=true&hide_owner=true&hide_share=true&hide_title=true>" %}
Colab + H2O AutoML (video)
{% endembed %}

{% stepper %}
{% step %}
**Open Colab**

Sign in to Colab:

{% embed url="<https://colab.research.google.com/>" %}
Google Colab
{% endembed %}
{% endstep %}

{% step %}
**Connect a runtime**

Connect to a hosted runtime.

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2Fgit-blob-2824041ef229a7809a749bd62547ab0dac98728f%2Fimage.png?alt=media" alt=""><figcaption><p>Colab hosted runtime</p></figcaption></figure>
{% endstep %}

{% step %}
**Upload the notebook**

In Colab, select **File > Upload notebook**.

Upload:

`~/Workshop--Data-Integration/Labs/Module 7 - Use Cases/Machine Learning/AutoML/data/credit_card_fraud.ipynb`

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FiodJBuzlAAKeHq2PNUcm%2Fimage.png?alt=media&#x26;token=476d0ac4-661e-4700-8915-4b8e773dcc2f" alt=""><figcaption><p>Upload notebook</p></figcaption></figure>
{% endstep %}

{% step %}
**Upload the dataset**

When the notebook prompts for a file upload, upload `H2O.csv`. Create it by running `autoML.ktr` in PDI.
{% endstep %}
{% endstepper %}

***

**AutoML Script**

These are the code sections for the Jupyter file: `credit_card_fraud.ipynb`:

1. Install the h2o libraries:

```python
# Install H2O
!pip install h2o -q
print("Installation complete.")
```

{% hint style="info" %}

* Installs H2O.ai's machine learning platform
* `-q` flag = quiet mode (minimal output)
  {% endhint %}

2. Import libraries:

```python
# Import libraries
import h2o
from h2o.automl import H2OAutoML
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Initialize H2O
h2o.init(max_mem_size='4G', nthreads=-1)
print("\nH2O initialized.")
     
```

{% hint style="info" %}

* `max_mem_size='4G'`: Allocates 4GB RAM to H2O cluster
* `nthreads=-1`: Uses all available CPU cores
  {% endhint %}

3. Upload data:

```py
# Upload data
from google.colab import files
uploaded = files.upload()
```

{% hint style="info" %}
**Purpose**: Interactive file upload widget in Google Colab

**Expected file**: H2O`.csv` (semicolon-separated)
{% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FCfc7uRL3xVU86F4RiMWT%2Fimage.png?alt=media&#x26;token=9bc9194f-3c65-4c49-b6fa-0c9007bf17fd" alt=""><figcaption></figcaption></figure>

4. Load data:

```py
# Load data
data = pd.read_csv('H2O.csv', sep=';', engine='python', header=None)

data.columns = [
    'first_time_customer', 'order_dollar_amount', 'num_items', 'age',
    'web_order', 'total_transactions', 'hour_of_day',
    'billing_shipping_match', 'fraud'
]

# Convert target to string
data['fraud'] = data['fraud'].astype(str)

print(f"Dataset: {data.shape[0]:,} rows × {data.shape[1]} columns")
print(f"Fraud cases: {(data['fraud']=='1').sum():,} ({(data['fraud']=='1').mean()*100:.2f}%)")
```

{% hint style="info" %}
**Critical steps**:

1. **Custom separator**: Uses `;` instead of default `,`
2. **Manual column naming**: Original file has no headers
3. **Target conversion**: Converts `fraud` to string for H2O classification

**Feature breakdown**:

* `first_time_customer`: Binary indicator (risky for fraud)
* `order_dollar_amount`: Transaction value
* `num_items`: Cart size
* `age`: Customer age
* `web_order`: Online vs in-store
* `total_transactions`: Customer history
* `hour_of_day`: Timing patterns (fraud often occurs at odd hours)
* `billing_shipping_match`: Address mismatch (major fraud signal)
* `fraud`: Target variable (0=legitimate, 1=fraud)
  {% endhint %}

{% hint style="warning" %}
Your upstream transformation may use a different target field name (for example, `reported_as_fraud_historic`). In this notebook, the target column is named `fraud`.
{% endhint %}

5. Distribution check:

```py
# Quick class distribution check
fraud_dist = data['fraud'].value_counts()

plt.figure(figsize=(10, 4))
fraud_dist.plot(kind='bar', color=['green', 'red'])
plt.title('Class Distribution', fontweight='bold')
plt.xlabel('Class (0=Normal, 1=Fraud)')
plt.ylabel('Count')
plt.xticks([0, 1], ['Normal', 'Fraud'], rotation=0)
for i, v in enumerate(fraud_dist):
    plt.text(i, v + 50, f'{v:,}', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()
```

{% hint style="info" %}
**Purpose**: Reveals class imbalance

**Why this matters**:

* Fraud datasets are typically **highly imbalanced** (99% legitimate, 1% fraud)
* This visualization confirms the need for `balance_classes=True` later
* Shows if we have enough fraud cases to train effectively
  {% endhint %}

6. Convert to H2O frame:

```py
# Convert to H2O frame
hf = h2o.H2OFrame(data)
hf['fraud'] = hf['fraud'].asfactor()

# Split: 75% train, 25% test
train, test = hf.split_frame(ratios=[0.75], seed=42)

print(f"Training: {train.nrows:,} rows")
print(f"Test:     {test.nrows:,} rows")

# Define features and target
x = train.columns
x.remove('fraud')
y = 'fraud'
```

{% hint style="info" %}
**Key operations**:

1. **H2OFrame conversion**: Moves data into H2O's distributed framework
2. **Factor conversion**: Tells H2O this is classification (not regression)
3. **75/25 split**: Standard training/testing division
4. **Seed=42**: Ensures reproducible results - controls randomness
5. **Feature/target separation**: Prepares for model training

**Why this split?**:

* 75% gives enough data to learn patterns
* 25% provides robust performance estimates
* Random splitting prevents temporal bias
  {% endhint %}

7. Run Models (Top 5)

```py
# Fast configuration
start_time = datetime.now()

print("\nFAST H2O AutoML")
print("="*70)
print("Configuration:")
print("  Max models: 5 (for speed)")
print("  Max runtime: 15 minutes")
print("  Algorithms: GBM, XGBoost, GLM, DRF, DeepLearning")
print("  Balance classes: True")
print("\nTraining...\n")

aml = H2OAutoML(
    max_models=5,                   # Only 5 models for speed
    max_runtime_secs=900,           # 15 minutes max
    seed=42,
    balance_classes=True,           # Handle imbalanced data
    nfolds=5,                       # 5-fold CV
    sort_metric='AUC',

    # Best algorithms for fraud detection
    include_algos=[
        'GBM',           # Gradient Boosting (usually best)
        'XGBoost',       # Extreme Gradient Boosting
        'GLM',           # Generalized Linear Model (fast)
        'DRF',           # Distributed Random Forest (tree-based)
        'DeepLearning'   # Neural Network
    ],

    verbosity='info'
)

# Train
aml.train(x=x, y=y, training_frame=train)

duration = (datetime.now() - start_time).total_seconds()
print(f"\nTraining completed in {duration/60:.1f} minutes.")
print("="*70)
```

**Configuration explained**:

* `max_models=5`: keeps runtime short.
* `max_runtime_secs=900`: hard stop at 15 minutes.
* `balance_classes=True`: important for imbalanced fraud data.
* `nfolds=5`: cross-validation for more stable metrics.
* `sort_metric='AUC'`: simple default ranking metric.

{% hint style="info" %}
**Algorithm selection rationale**:

1. **GBM (Gradient Boosting Machine)**
   * Usually #1 performer for fraud
   * Handles complex patterns
   * Native H2O implementation
2. **XGBoost**
   * Industry standard for tabular data
   * Fast training
   * Excellent with imbalanced data
3. **GLM (Generalized Linear Model)**
   * Fast baseline
   * Interpretable coefficients
   * Good for linear relationships
4. **DRF (Distributed Random Forest)**
   * Ensemble of decision trees
   * Handles non-linear patterns
   * Robust to outliers
5. **DeepLearning**
   * Neural network
   * Captures complex interactions
   * May find unexpected patterns

**Why no standalone Decision Trees?**\
Ensemble methods (GBM, XGBoost, DRF) combine hundreds of trees, always outperforming single trees.
{% endhint %}

8. Model Leaderboard:

```py
# View leaderboard
lb = aml.leaderboard
print("\n ALL MODELS RANKED BY PERFORMANCE")
print("="*70)
print(lb)
print("="*70)
print(f"\n Total models trained: {lb.nrows}")
print(f"Best model: {lb.as_data_frame()['model_id'][0]}")
```

{% hint style="info" %}
**Output includes**:

* Model IDs
* **AUC** (Area Under ROC Curve)
* **AUCPR** (Area Under Precision-Recall Curve)
* **Mean per-class error**
* **RMSE/MSE** (less relevant for classification)

**Best metric for fraud**: AUCPR (handles class imbalance better than AUC)
{% endhint %}

9. Get Best Model:

```python
# Get best model
best = aml.leader

print(f"\n BEST MODEL: {best.model_id}")
print(f"Algorithm: {best.algo}")

# Performance on test set
perf = best.model_performance(test)

print("\n TEST SET PERFORMANCE")
print("="*70)
print(f"AUC:       {perf.auc():.4f}")
print(f"Accuracy:  {perf.accuracy()[0][1]:.4f}")
print(f"Precision: {perf.precision()[0][1]:.4f}")
print(f"Recall:    {perf.recall()[0][1]:.4f}")
print(f"F1 Score:  {perf.F1()[0][1]:.4f}")
print("="*70)
```

{% hint style="info" %}
**Metrics explained for fraud detection**:

* **AUC (0.95+ is excellent)**: Overall discrimination ability
* **Accuracy**: Misleading with imbalanced data (can be high even missing all fraud)
* **Precision**: Of flagged transactions, what % are actually fraud?
* **Recall**: Of all fraud cases, what % did we catch? (**Most critical for fraud**)
* **F1 Score**: Balance between precision and recall

**Real-world tradeoff**:

* High recall = catch more fraud (but more false alarms)
* High precision = fewer false alarms (but miss more fraud)
  {% endhint %}

10. Confusion Matrix:

```python
# Confusion matrix
cm = perf.confusion_matrix()
print("\n CONFUSION MATRIX")
print("="*70)
print(cm)
print("="*70)

# Extract values
cm_table = cm.table.as_data_frame()
try:
    tn = int(cm_table.iloc[0, 1])
    fp = int(cm_table.iloc[0, 2])
    fn = int(cm_table.iloc[1, 1])
    tp = int(cm_table.iloc[1, 2])

    print("\n BREAKDOWN")
    print(f"True Negatives:  {tn:>6,}")
    print(f"False Positives: {fp:>6,}")
    print(f"False Negatives: {fn:>6,} (missed fraud)")
    print(f"True Positives:  {tp:>6,}")
    print(f"\nCaught {tp} fraud cases, missed {fn}.")
except:
    print("See confusion matrix table above")
```

{% hint style="info" %}
**Business impact**:

* **FN (False Negatives)**: Direct financial loss from undetected fraud
* **FP (False Positives)**: Customer friction from declined legitimate transactions
* **Optimal balance**: Depends on fraud costs vs customer experience impact
  {% endhint %}

11. Feature Importance:

```python
# Top features
varimp = best.varimp(use_pandas=True)
print("\n TOP FEATURES")
print("="*70)
print(varimp.head(10))
print("="*70)

# Plot
best.varimp_plot()
```

{% hint style="info" %}
**Why this matters**:

1. **Explainability**: Understand what drives fraud predictions
2. **Feature engineering**: Focus efforts on most impactful features
3. **Regulatory compliance**: Many industries require model interpretability
4. **Data quality**: Verify that sensible features are important

**Expected top features**:

* `billing_shipping_match` (address verification)
* `order_dollar_amount` (unusually large transactions)
* `total_transactions` (customer history)
  {% endhint %}

12. Save Model:

```python
model_path = h2o.save_model(model=best, path="./", force=True)
files.download(model_path)
```

{% hint style="info" %}
**What this saves**: an H2O model artifact you can load back into H2O.

**Deployment options**:

* Load in production H2O cluster
* Export a MOJO for low-latency scoring (separate step)
* Export to PMML if your workflow requires it (separate step)
  {% endhint %}

13. Summary Report:

```python
# Generate summary
report = f"""
FRAUD DETECTION - QUICK SUMMARY
{'='*60}

Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

Dataset:
  Total:     {len(data):,}
  Training:  {train.nrows:,}
  Test:      {test.nrows:,}

Model:
  Best:      {best.model_id}
  Algorithm: {best.algo}
  Training:  {duration/60:.1f} minutes

Performance:
  AUC:       {perf.auc():.4f}
  Accuracy:  {perf.accuracy()[0][1]:.4f}
  Precision: {perf.precision()[0][1]:.4f}
  Recall:    {perf.recall()[0][1]:.4f}
  F1:        {perf.F1()[0][1]:.4f}

Model saved: {model_path}

{'='*60}
"""

print(report)

# Save report
with open('fraud_summary.txt', 'w') as f:
    f.write(report)

files.download('fraud_summary.txt')
print("Summary downloaded.")
```

{% hint style="info" %}
**Purpose**: quick documentation for:

* Model tracking
* Stakeholder reporting
* Experiment comparison
* Regulatory audit trails
  {% endhint %}

{% hint style="info" %}
After you validate the notebook, you can move the logic into PDI. Use the AutoML results to decide which models to operationalize.
{% endhint %}
{% endtab %}

{% tab title="3. Python Executor" %}
{% hint style="info" %}

#### Python Executor

The Python Executor step lets you run Python code as part of a Pentaho Data Integration (PDI) transformation.

This step is designed to help developers and data scientists focus on Python-based analytics and algorithms while using PDI for common ETL work such as connecting to sources, joining, and filtering.

You can run Python with either:

* Row-by-row processing: PDI maps each incoming row to Python variables and runs the script once per row.
* All-rows processing: PDI transfers the full dataset at once (for example, into a pandas DataFrame) and runs the script.

This step supports the CPython runtime only.
{% endhint %}

For further details:

1. Click the **Input** tab.
2. Map PDI fields into Python variables.
3. Use **All rows** for DataFrames. It processes the full dataset at once.
4. Select **All rows** to process all data at once (for example, when passing a list of dicts).

<table><thead><tr><th width="194">Option</th><th>Description</th></tr></thead><tbody><tr><td>Available variables</td><td>Use the plus sign button to add a Python variable to the input mapping for the script used in the transformation. You can remove the Python variable by clicking the X icon.</td></tr><tr><td>Variable name</td><td>Enter the name of the Python variable. The list of available variables will update automatically.</td></tr><tr><td>Step</td><td>Specify the name of the input step to map from. It can be any step in the parent transformation with an outgoing hop connected to the Python Executor step.</td></tr><tr><td>Data structure</td><td><p>Specify the data structure from which you want to pull the fields for mapping. You can select one of the following:</p><p>· Pandas data frame: the tabular data structure for Python/Pandas.</p><p>· NumPy array: the table of values, all the same type, which is indexed by a tuple of positive integers.</p><p>· Python List of Dictionaries: each row in the PDI stream becomes a Python dictionary. All the dictionaries are put into a Python list.</p></td></tr></tbody></table>

5. In **Mapping**, configure these fields:

<table><thead><tr><th width="201">Field Property</th><th>Description</th></tr></thead><tbody><tr><td>Data structure field</td><td>The value of the Python data structure field to which you want to map the PDI field.</td></tr><tr><td>Data structure type</td><td>The value of the data structure type assigned to the data structure field to which you want to map the PDI field.</td></tr><tr><td>PDI field</td><td>The name of the PDI field which contains the vector data stored in the mapped Python variable.</td></tr><tr><td>PDI data type</td><td>The value of the data type assigned to the PDI field, such as a date, a number, or a timestamp.</td></tr></tbody></table>
{% endtab %}

{% tab title="4. GBM" %}
{% hint style="info" %}

#### Gradient Boosting Machine

Gradient Boosting Machine (GBM) is an ensemble learning algorithm that builds a series of weak decision trees sequentially, where each new tree focuses on correcting the errors made by the previous ones. The final prediction is a weighted combination of all these trees, resulting in a highly accurate model.

In fraud detection, GBM is particularly effective because it handles the inherent class imbalance (fraud being rare compared to legitimate transactions) and can capture complex, non-linear patterns in transaction data - such as unusual spending amounts, atypical merchant categories, or irregular timing - that simple rule-based systems would miss.

It also provides feature importance rankings, allowing analysts to understand which variables (e.g., transaction velocity, geolocation mismatches, or device fingerprints) are the strongest indicators of fraudulent behavior, making the model both powerful and interpretable for investigation teams.
{% endhint %}

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FzOYf7ZIT32BZkczp4KcF%2Fimage.png?alt=media&#x26;token=bcbd4235-e1d0-4773-886e-83612d233a85" alt=""><figcaption><p>Gradient Boosting</p></figcaption></figure>

{% hint style="info" %}
Gradient Boosting Machine works through an iterative, additive process:

**Step 1: Start with a baseline.** The algorithm begins with a simple initial prediction, often just the average of the target variable (e.g., "fraud" or "not fraud").

**Step 2: Calculate the errors (residuals).** It measures how far off that initial prediction is from the actual values. These errors are called *residuals*.

**Step 3: Fit a weak learner to the residuals.** A small, shallow decision tree is trained - not on the original target, but on the *errors* from the previous step. The goal is to learn the pattern in what the model got wrong.

**Step 4: Update the prediction.** The output of this new tree is added to the existing prediction, scaled down by a *learning rate* (a small number like 0.1) to prevent overcorrecting. So the updated model becomes:

<mark style="color:green;">**New Prediction = Old Prediction + (learning rate × New Tree's output)**</mark>

**Step 5: Repeat.** Steps 2–4 are repeated for hundreds or thousands of iterations. Each new tree corrects a little more of the remaining error.

The "gradient" in the name comes from the fact that the algorithm uses *gradient descent* - the same optimization technique used in neural networks - to minimize a loss function (e.g., log-loss for classification). The residuals it fits at each step are actually the negative gradients of that loss function.

Over many rounds, these individually weak trees combine into a very strong predictive model. In fraud detection, this means early trees might learn broad patterns (like "very large transactions are riskier"), while later trees pick up subtle, nuanced signals (like "this card was used in two countries within an hour") that distinguish genuine fraud from normal behavior.
{% endhint %}

***

**Install GBM**

1. Execute the following script to install GBM.

```r
# Install gbm from R
sudo R -e "install.packages('gbm', repos='https://cran.r-project.org')"
```

<figure><img src="https://3680356391-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZpCSy6Skj215f4oWypdc%2Fuploads%2FpuTNdETKSr6S5To8bbmU%2Fimage.png?alt=media&#x26;token=b202c732-171e-4a68-b34a-5ea5fb16d52b" alt=""><figcaption><p>Install GBM</p></figcaption></figure>

2. Verify.

```r
R -e "library(gbm); cat('GBM loaded successfully\n')"
```

{% endtab %}
{% endtabs %}
{% endtab %}
{% endtabs %}

<details>

<summary>Metrics cheat sheet (fraud detection)</summary>

Use these metrics to compare models and pick a decision threshold. Fraud data is usually highly imbalanced.

#### What to look at first

* **AUCPR**: best single-number metric for imbalanced classification.
* **Recall**: how much fraud you catch (minimize false negatives).
* **Precision**: how many alerts are real fraud (minimize false positives).

#### Use AUC as a sanity check

**AUC** measures how well the model ranks fraud above legitimate transactions. It can look “good” even when precision is poor at useful thresholds.

#### Why accuracy is a trap

If fraud is 1% of transactions, a model can be 99% accurate and useless. It can do that by predicting “legitimate” for every row.

#### Practical selection guide

* If fraud losses are expensive, prioritize **recall**.
* If customer friction is expensive, prioritize **precision**.
* Use **AUCPR** to compare candidates before tuning thresholds.

#### What to ignore (most of the time)

* **RMSE/MSE**: regression-style errors on probabilities. Not decision-friendly here.

</details>
