# Partition {% tabs %} {% tab title="English" %} x {% endtab %} {% tab title="Untitled" %} {% endtab %} {% tab title="Untitled" %} {% endtab %} {% endtabs %} {% hint style="info" %} Partitioning data in Pentaho Data Integration allows you to distribute your data into distinct subsets based on a specific rule, such as a field value or a hash function. This can improve the performance and scalability of your data integration jobs, especially when you have a large amount of data or multiple servers. Partitioning data can also help you avoid data skew and resource underutilization. {% endhint %} {% embed url="" %} {% tabs %} {% tab title="Partition during data processing" %} {% hint style="info" %} By default, each step in a transformation is executed in parallel in a single separate thread. With a single copy of each step, the data is read from the CSV file input step and then aggregated in the count by state step. {% endhint %}

1. Open transformation: \~/Workshop--Data-Integration/Labs/Module 5 - Enterprise Solution/Scalability/Demo - Partitioning/tr\_parallel\_reading\_and\_aggregation.ktr {% hint style="info" %} To take advantage of the processing resources in your server, you can scale up the transformation using the multi-threading option Change Number of Copies to Start to produce copies of the steps (right-click the step to access the menu). As shown below, the x2 notation indicates that two copies will be started at runtime. {% endhint %} 2. Change the number of copies for the CSV file input to 2.

{% hint style="info" %} By default, this data movement from the CSV file input step into the count by state step will be performed in round-robin order. This means that if there are 'N' copies, the first copy gets the first row, the second copy gets the second row, and the Nth copy receives the Nth row. Row N+1 goes to the first copy again, and so on until there are no more rows to distribute. Reading the data from the CSV file is done in parallel. {% endhint %}

{% hint style="info" %} Attempting to aggregate in parallel, however, produces incorrect results because the rows are split arbitrarily (without a specific rule) over the two copies of the count by state aggregation step, as shown in the preview data. {% endhint %}

Duplication due to round-robin aggregation

3. Preview the data .. notice that some of the 'State counts' are duplicated. Why ? and can you suggest how to solve the problem? {% endtab %} {% tab title="Partition Schema" %} {% hint style="info" %} This is where partitioning data becomes a useful concept, as it applies specific rule-based direction for aggregation, directing rows from the same state to the same step copy, so that the rows are not split arbitrarily. In the example below, a partition schema called 'State' was applied to the 'count by state' step and the 'Remainder of division' partitioning rule was applied to the 'State' field. Now, the count by state aggregation step produces consistent correct results because the rows were split up according to the partition schema and rule. {% endhint %}

{% hint style="info" %} So whats happening behind the scenes? Remainder of division is a partitioning method that assigns a row to a partition based on the remainder of dividing a field value by the number of partitions. For example, if you have four partitions and a field value of 13, the remainder of division is 1 (13 mod 4 = 1), so the row goes to the first partition. This method ensures that rows with the same field value end up in the same partition, which can be useful for aggregation or grouping operations. {% endhint %} {% endtab %} {% tab title="Partition Tables" %} {% hint style="info" %} The Table output step (double-click the step to open it) supports partitioning rows of data to different tables. When configured to accept the table name from a Partitioning field, the PDI client will output the rows to the appropriate table. You can also Partition data per month or Partition data per day. To ensure that all the necessary tables exist, its recommend to create them in a separate transformation. You can choose from different partitioning methods, such as remainder of division, binary tree, mirror sequence, or custom. {% endhint %}

{% endtab %} {% endtabs %} --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://academy.pentaho.com/pentaho-data-integration/data-integration/enterprise-solution/scalability/partition.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.