Cross Join
Good old Cartesian Join ..
Workshop - Cross Join
The CARTESIAN JOIN or CROSS JOIN returns the Cartesian product of the sets of records from two or more joined tables. Thus, it equates to an inner join where the join-condition always evaluates to either True or where the join-condition is absent from the statement .. whatever that means .. basically, its every possible combination.
In this workshop we'll be cross joining first names with middle names and again with our surname.



Create a new Transformation
Any one of these actions opens a new Transformation tab for you to begin designing your transformation.
By clicking File > New > Transformation
By using the CTRL-N hot key
You should be familiar with the Data Grid step. Used to list values for first and middle names.


Joins all possible first_name, middle_name combinations together.

You can also add a condition to constrain the resulting dataset.
2 new fields are added to the data stream:
first_middle_name: concates the first_names and middle names.
initials: returns the 2 character initials.
This method has two variants and returns a new string that is a substring of this string. The substring begins with the character at the specified index and extends to the end of this string or up to endIndex – 1, if the second argument is given.

Returns the value associated with the ${surname}. This is set in the Parameters tab in Transformation properties.

Joins all possible first_middle_name, surname combinations together. The output for initials is also excludes in the list various initials combinations.

Determine the order and selection of the data stream fields.

2 new fields are added to the data stream:
boys_initials: returns the babys’ 3 character initials.
boys_name: concates first_name + middle_name + surname

Returns 5 sampled records.
Returns Random seed ${seed}
Reservoir Sampling allows you to select a set number of random records, from an unknown number ‘reservoir’ of records, i.e. not known beforehand.
Use a different seed value to ensure no two ‘sets’ are the same.

RUN
The workshop illustrates the use of cross joins to create data sets with every possible combination - unless conditions are set. The final dataset is randomly selected using Reservoir Sampling - a common technique used in ML.
Click the Run button in the Canvas Toolbar.
Click on the Preview tab:

Last updated
Was this helpful?
