AutoML

Automated Machine Learning (AutoML) is tied in with producing Machine Learning solutions for the data scientist without doing unlimited inquiries on data preparation, model selection, model parameters

Imagine that a direct retailer wants to reduce losses due to orders involving fraudulent use of credit cards. They accept orders via phone and their web site, and ship goods directly to the customer.

Basic customer details, such as customer name, date of birth, billing address and preferred shipping address, are stored in a relational database.

Orders, as they come in, are stored in a database. There is also a report of historical instances of fraud contained in a CSV spreadsheet.

To listen to the video please copy and paste the website URL into your host Chrome browser, as there's no soundcard in the Lab environment.

In this lab you will:

Prepare Data - Data Wrangling
Set Feature Engineering
TPOT - automated ML to determine algorithm.
Colab - Build and Train a Decision Tree Model.
Deploy and Test the model.

With the goal of preparing a dataset for ML, we can use PDI to combine these disparate data sources and engineer some features for learning from it. The following figure shows a transformation demonstrating an example of just that, and includes some steps for deriving new fields.

To begin with, customer data is joined from several data sources, and then blended with transactional data and historical fraud occurrences contained in a CSV file.

Start PDI

cd
cd ~/Pentaho/design-tools/data-integration
./spoon.sh

Open the following autoML.ktr

~/Workshop--Data-Integration/Labs/Module 6 - Machine Learning/autoML.ktr

Browse the various customer data sources:

Customer Data

Where you will find the customer_billing_zip codes, which will be used in feature engineering:

So, what does the data scientist do at this point?

Typically, they will want to get a feel for the data by examining simple summary statistics and visualizations, followed by applying quick techniques for assessing the relationship between individual attributes (fields) and the target of interest which, in this example, is the reported_as_fraud_historic field.

Following that, if there are attributes that look promising, quick tests with common supervised classification algorithms will be next on the list. This comprises the initial stages of experimental data mining – that is, the process of determining which predictive techniques are going to give the best result for a given problem.

Create the TPOT dataset

Run the transformation: autoML.ktr
Preview the results: data/TPOT.csv

This will be the dataset used for autoML in Colab.

Tree-based Pipeline Optimization Tool for Automating Machine Learning (TPOT) is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best fit for your data.

Sign into Colab.

Connect to a hosted runtime.

Select File -> Open Notebook

Upload:

/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/AutoML/data/credit_card_fraud.ipynb

AutoML Script

These are the code sections for the Jupyter file credit_card_fraud.ipynb:

Install the TPOT libraries:

# Installs TPOT libraries.
!pip install tpot

Import libraries:

# import libraries
import numpy as np
import pandas as pd
from tpot import TPOTClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

Import dataset:

from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
 
dataset = pd.read_csv('TPOT.csv', sep= ';', header=None)
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 8].values

Path to TPOT.csv

/home/pentaho/Workshop--Data-Integration/Labs/Module 7 - Workflows/Machine Learning/AutoML/data/TPOT.csv:

Add column headers:

dataset.columns = ['first_time_customer','order_dollar_amount','num_items','age','web_order','total_transactions_to_date','hour_of_day','billing_shipping_zip_equal','reported_as_fraud_historic']

Convert dataset to numpy array and fit data (optional):

x = dataset.iloc[:,0:-1].values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
X=np.asarray(x_scaled)
y=np.asarray(dataset.iloc[:,-1])

Split the dataset. 75% used for test.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.75, random_state=None)

8. Run TPOT Classifier:

tpot = TPOTClassifier(generations=1, verbosity=2, population_size=100, scoring='accuracy', n_jobs = -1, config_dict='TPOT light')
tpot.fit(X_train, y_train)
output_score=str(tpot.score(X_test, y_test))
print(tpot.fitted_pipeline_)

Export Pipeline as Python script:

tpot.export('tpot_exported_credit_card_pipeline.py')
from google.colab import files
files.download('tpot_exported_credit_card_pipeline.py')

Once the python script has been tested, it can added to our autoML transformation and from the output suggested algorithms can tested.

Enable the rest of the hops in the transformation, except: Model Catalogue (table).
Open the step: auto machine learning

Ensure you set the path to Python3.10

To ensure the script does not take a long time to process, the following TPOT parameters have been set:

tpot = TPOTClassifier(generations=1, verbosity=2,population_size=100, config_dict='TPOT light')

For further details:

Click on the Input tab.

a. Use this tab to make selections for moving data from PDI fields to Python variables.

b. The All rows option is commonly used for data frames. A data frame is used for storing data tables and is composed of a list of vectors of equal length.

c. Select the All rows option to process all your data at once, for example, using the Python list of dictionaries.

Option

Description

Available variables

Use the plus sign button to add a Python variable to the input mapping for the script used in the transformation. You can remove the Python variable by clicking the X icon.

Variable name

Enter the name of the Python variable. The list of available variables will update automatically.

Step

Specify the name of the input step to map from. It can be any step in the parent transformation with an outgoing hop connected to the Python Executor step.

Data structure

Specify the data structure from which you want to pull the fields for mapping. You can select one of the following:

· Pandas data frame: the tabular data structure for Python/Pandas.

· NumPy array: the table of values, all the same type, which is indexed by a tuple of positive integers.

· Python List of Dictionaries: each row in the PDI stream becomes a Python dictionary. All the dictionaries are put into a Python list.

The Mapping table contains the following field properties:

Field Property

Description

Data structure field

The value of the Python data structure field to which you want to map the PDI field.

Data structure type

The value of the data structure type assigned to the data structure field to which you want to map the PDI field.

PDI field

The name of the PDI field which contains the vector data stored in the mapped Python variable.

PDI data type

The value of the data type assigned to the PDI field, such as a date, a number, or a timestamp.

The cust variable defines the dataframe in the Python script using iloc:

x = dataset.iloc[:,1:-1].values

The dataframe is pulled from the PDI step sv-changes_to_numbers.

From this list, for the purposes of predictive modeling, we can drop the customer name, ID fields, email addresses, phone numbers and physical addresses. These fields are unlikely to be useful for learning purposes and, in fact, can be detrimental due to the large number of distinct values they contain.

Click on the Output tab.

The output of model.df dataframe, from the script:

model_df=pd.DataFrame(model_list,columns=['pipe','generation','mutation','crossover','predecessor','operator','cv'])

is converted back to PDI fields.

Preview the data for the tfo_model_catalogue step - sort by Cross Validation Performance..

What does this mean?

For the First Generation, the best algorithm pipeline run is DecisionTree with a scoring of 0.8583 and accuracy of 0.8448 (figure used to judge the quality of the pipeline - look in the 'last' logging line ).

The best pipeline to use (with 86% accuracy) for this dataset is based on Decision Trees with a minimum of 5 trees.

It may also be worth looking at KNeighbors Classifier.

The object of using TPOT is to point you in the right direction for selecting the appropriate algorithm.

💡The results will be different each time you run the TPOTClassifier.

PreviousPrerequiste Tasks NextCredit Card

Last updated 2 months ago

Was this helpful?