Pentaho Data Integration: A Tool to Have on Your Toolbelt

Today I wanna share something nice I've been using: Pentaho Data Integration (aka Kettle).

As the name says, PDI is a powerful tool for manipulating and integrating data accross multiple souces (databases, files, APIs, CRMs etc).

This process of extracting, transforming and loading data is called ETL, and PDI is meant for that.

How it works

The building blocks of PDI are: job and transformation.

A transformation is where the magic happens and a job is a set of transformations.

Both can be stored locally or in a remote repository accessible to all your teammates.

You can also have a remote server responsible for running your jobs in a scheduler like crontab.


PDI is a paid product, but they provide a free community edition (CE), which is amazing by the way (I've been using just that so far).

There are many CE versions available, enter here and download the one you want (generally the last one).

PDI requires Java Runtime Environment (JRE), so make sure it's installed.

PDI does not require installation. Simply unpack the zip file into a folder of your choice and click Data Integration to open.

Check this link for more info.

Use case: crossing data

Imagine you have two applications with distinct databases: one PostgreSQL (A) and other SQL Server (B) with the following tables:

How can we cross customers from the two databases using the email attribute?

First, we need to create a transformation.

Open PDI and create a new transformation (File -> New -> Transformation).

This is how a blank transformation looks like.

In the sidebar we have all the available steps and in the big empty space, is where we drag and drop the steps we wanna use.

To join the tables, we need to input them first, so drag a Table input step to the transformation, double-click it and type the SQL query you wanna execute.

You will have to create a database connection first, just click New in front of Connection and enter the credentials.

We need to order by email because this field will be used to join the tables.

Drag another Table input for the SQL Server database and do the same.

Now we need to join them, under the Joins folder, drag the Merge Join and select the steps, join type and the keys to join (in our case, email).

Once you are done, run the transformation (click the play button or Action -> Run). It will ask for a name, fill and save.

In Execution Results you can see a bunch of details related to your run, in Preview data, we can check the merge was done as expected.

Note that we are not outputting the result of our transformation (to a .xlsx file for example), but you could do this by dragging a Output step and setting it as you wish.

Wrapping up

Pentaho Data Integration is a fantastic tool that will open you a million of possibilities.

This post is just to introduce it to you. Explore others steps and have fun.

Written on December 30, 2017