> For the complete documentation index, see [llms.txt](https://academy.pentaho.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://academy.pentaho.com/pentaho-data-integration/use-cases/machine-learning/credit-card.md). # Credit Card {% hint style="info" %} The results from H2O point to using a **Gradient Boosting (GBM)** algorithm. In this lab, you operationalize that choice in PDI: * Train a GBM model in R. * Save the model artifact. * Predict fraudulent credit card transactions. You will use the R `gbm` package. {% endhint %} {% embed url="" %} Walkthrough (video) {% endembed %} {% tabs %} {% tab title="Train the Model" %} {% hint style="info" %} Train a GBM model with the same dataset. {% endhint %}

1. In Spoon, open the following main job: ``` ~/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/Credit Card Fraud/solution/jb_fraud_main_job.kjb ``` 2. Right-click the **train\_model** transformation. 3. Select **Open referenced object > Transformation**.

*** **R Script Executor** 1. Open the `rscrpt-train_gbm` step. 2. On the **Configure** tab, set: * **Input frames**: `sv-convert_booleans_to_numbers` * **R frame name**: `train`

3. Set **Row handling > Number of rows to process** to **All**. 4. On the **R script** tab, paste this script: ```r # ============================================================ # GBM Model Training - Credit Card Fraud Detection # ============================================================ # This script trains a Gradient Boosting Machine (GBM) model # to predict fraudulent credit card transactions. # It runs inside the PDI R Script Executor step. # ============================================================ # Load the GBM library for gradient boosting library(gbm) # Convert the incoming PDI data frame ("train") to a standard R data frame # The "train" variable is automatically created by the R Script Executor # from the input step: sv-convert_booleans_to_numbers train.df <- as.data.frame(train) # Convert the target variable to binary 0/1 # as.factor() creates levels, as.numeric() assigns 1 and 2, then subtract 1 # Result: 0 = not fraud, 1 = fraud train.df$reported_as_fraud_historic <- as.numeric( as.factor(train.df$reported_as_fraud_historic) ) - 1 # Train the GBM model # -------------------------------------------------------- # Note: We use OOB (out-of-bag) estimation instead of # cross-validation (cv.folds) because JRI runs R inside # the JVM process. Cross-validation spawns additional # processes that exceed the FD_SETSIZE limit (1024) in # Linux's select() system call, causing the JVM to abort. # -------------------------------------------------------- gbm_model <- gbm( reported_as_fraud_historic ~ ., # predict fraud using all other columns data = train.df, # training data distribution = "bernoulli", # binary classification (fraud yes/no) n.trees = 500, # number of boosting iterations interaction.depth = 4, # max depth of each tree shrinkage = 0.01, # learning rate (smaller = more robust) n.minobsinnode = 10, # min observations per terminal node bag.fraction = 0.5 # use 50% of data per tree (stochastic GBM) ) # Determine the optimal number of trees using OOB error # This avoids overfitting by finding where performance plateaus best_trees <- gbm.perf(gbm_model, method = "OOB") # Save the trained model and optimal tree count to disk # The predict transformation will load this file to score new transactions save(gbm_model, best_trees, file = "/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Use Cases/Machine Learning/Credit Card Fraud/solution/train_model_output/gbm_fraud.rdata" ) # Return a status message to PDI # The R Script Executor expects a data frame as output ok <- "Finished" ok.df <- as.data.frame(ok) ok.df ) # Determine the optimal number of trees using Out-of-Bag (OOB) estimation. # gbm.perf() analyzes the OOB improvement curve and returns the iteration # (tree count) where the OOB error is minimized. Using more trees than this # would overfit; using fewer would underfit. This value will be used during # scoring to make predictions with only the best-performing subset of trees. best_trees <- gbm.perf(gbm_model, method = "OOB") # Save the trained model object and the optimal tree count to an .rdata file. # This file will be loaded later by a separate PDI scoring transformation # to apply the model to new/unseen transactions for real-time fraud detection. save( gbm_model, best_trees, file = "/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Use Cases/Machine Learning/Credit Card Fraud/solution/train_model_output/gbm_fraud.rdata" ) # If we reach this point, training completed successfully. # This string is returned as the value of 'ok' by tryCatch(). "Finished" # Error handler: if ANY step above throws an R error, this function catches it # and returns the error message as a string. This ensures the script always # produces the 'ok' output column that PDI expects, while preserving the # actual error details for troubleshooting in the PDI log. }, error = function(e) { paste("ERROR:", e$message) }) # Create a single-column data frame with the status result. # The R Script Executor step in PDI expects a data frame as output. # The column name 'ok' must match what is configured in the step's output fields. # - "Finished" = training completed successfully # - "ERROR: " = training failed; check the message for the root cause # This allows downstream PDI steps (e.g., a Filter or Switch/Case) to route # the flow based on success or failure of the model training. ok.df <- as.data.frame(ok) ok.df ``` {% hint style="info" %} This step writes the model artifact to: `/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Use Cases/Machine Learning/Credit Card Fraud/solution/train_model_output/gbm_fraud.rdata` {% endhint %} {% endtab %} {% tab title="Predict Fraud" %} {% hint style="info" %} Use the saved GBM model to score new transactions. {% endhint %}

1. In Spoon, open the following main job: ``` ~/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/Credit Card Fraud/solution/jb_fraud_main_job.kjb ``` 2. Right-click the transformation labeled **predict fraud**. 3. Select **Open referenced object > Transformation**.

*** **R Script Executor** 1. Open the `rscrpt-predict` step. 2. On the **Configure** tab, set: * **Input frames**: `sv-convert_booleans_to_numbers` * **R frame name**: `test`

{% endtab %} {% endtabs %} --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://academy.pentaho.com/pentaho-data-integration/use-cases/machine-learning/credit-card.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.