Prerequiste Tasks
Configure Colab & Data Integration for ML ..
You will need to complete the following prerequisites:
Install Python
Create a Google CoLab account
Install R (optional R Studio)
Configure Pentaho Data Integration with R
Google Colab
Colab is a Python development environment, based on Jupyter Notebooks, that runs in the browser using Google Cloud.
It provides a runtime, fully configured for deep learning libraries, such as Keras, TensorFlow, PyTorch, and OpenCV.
If you haven't already .. sign up for a free account..!!

The following prerequiste steps configure your environment to RUN ML data pipelines in Pentaho Data Integration.
This section is for Reference only.
The following tasks configure Pentaho Data Integration in a Linux environment.
Python
Make sure all installed Packages are up-to-date.
Check to see if Python is installed.
Install the latest Python version
Only proceed to update your Python to the latest version if required.
Install dependencies.
Import key for PPA deadsnakes.
Add Repository.
Renew the cache, then find current Python version.
Install latest version.
Create symlink.
Different Python versions
You may have a particular one you want as the default for users needing multiple versions of Python on their system.
The default version of python has been set to 3.10
- required by Apache AirFlow
List the python versions:
Set the Python version:
Then set the required version:
The following libraries need to be installed:
pandas
matplotlib
py4j
numpy
wheel
scikit-learn
TPOT
Ensure pip is installed.
Install ML libraries.
Install R from Ubuntu Repository
R is a language and environment for statistical computing and graphics.
Update APT packages.
Install the R base package and its dependencies.
Check version.
Type
Rand hit enter to verify that R has been installed.
Using the R command without sudo creates a personal library for your user. To install packages available to every user on the system, run the R command as root by typing sudo -i R.
Type
q()to exit the R console.
Install missing dependency.
Reboot.
rJava
Check to see if Java is installed .. if so then move onto step 4.
Install the Java Runtime Environment (JRE).
Install the Java Development Kit (JDK).
Update where R expects to find various Java files.
In a 'R' Terminal.
Check rJava has successfully installed.
randomForest
The random forest classifier can be used to solve regression or classification problems. The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample.

In a R terminal.
Install randomForest package:
Type q() to quit the R console.
Click Yes to close the workplace image.
Close R.
RStudio
RStudio is an integrated development environment (IDE) comprising of a set of tools built to help you be more productive with R and Python.
Visit the RStudio downloads page to grab the latest release.
Install Package.
Once installed, in a Terminal.

Set Environmental Variables
You can find the paths to set for each of the environmental variable using R.
R_HOME
Path to the root directory of your R installation. Enter Sys.getenv("R_HOME") in the R console to get the path.
R_LIBS_USER
Path to the directory where R installs your packages.
Enter Sys.getenv("R_LIBS_USER") in the R console to get the path.
LD_LIBRARY_PATH
Used to load a libraries - libjri.so
PATH
Append the PATH variable with the directory that contains the R executable.
In a R Terminal
Edit the /etc/environment.
Copy & paste the values.
Ensure the path to the R/bin is added to PATH.
Save.
libjri.so
In the rJava directory, there is a libjri.so file that needs to be copied into the libswt directory of Spoon.
Copy libjri.so to PDI ../libswt/linux.
Reboot.
Test - R
Start Pentaho Data Integration.
Create the following transformation.

Copy and paste the following R script into the R Executor step.
Click on the 'Test Script' button.

Last updated
Was this helpful?
