Pentaho Data Integration: A Tool to Have on Your Toolbelt
Today I wanna share something nice I've been using: Pentaho Data Integration (aka Kettle).
As the name says, PDI is a powerful tool for manipulating and integrating data accross multiple souces (databases, files, APIs, CRMs etc).
This process of extracting, transforming and loading data is called ETL, and PDI is meant for that.
How it works
The building blocks of PDI are: job and transformation.
A transformation is where the magic happens and a job is a set of transformations.
Both can be stored locally or in a remote repository accessible to all your teammates.
You can also have a remote server responsible for running your jobs in a scheduler like crontab.
Setup
PDI is a paid product, but they provide a free community edition (CE), which is amazing by the way (I've been using just that so far).
There are many CE versions available, enter here and download the one you want (generally the last one).
PDI requires Java Runtime Environment (JRE), so make sure it's installed.
PDI does not require installation. Simply unpack the zip file into a folder of your choice and click Data Integration
to open.
Use case: crossing data
Imagine you have two applications with distinct databases: one PostgreSQL (A) and other SQL Server (B) with the following tables:
How can we cross customers from the two databases using the email attribute?
First, we need to create a transformation.
Open PDI and create a new transformation (File -> New -> Transformation).
This is how a blank transformation looks like.
In the sidebar we have all the available steps and in the big empty space, is where we drag and drop the steps we wanna use.
To join the tables, we need to input them first, so drag a Table input
step to the transformation, double-click it and type the SQL query you wanna execute.
You will have to create a database connection first, just click New in front of Connection and enter the credentials.
Drag another Table input
for the SQL Server database and do the same.
Now we need to join them, under the Joins
folder, drag the Merge Join
and select the steps, join type and the keys to join (in our case, email).
Once you are done, run the transformation (click the play button or Action -> Run). It will ask for a name, fill and save.
In Execution Results you can see a bunch of details related to your run, in Preview data, we can check the merge was done as expected.
Note that we are not outputting the result of our transformation (to a .xlsx file for example), but you could do this by dragging a Output
step and setting it as you wish.
Wrapping up
Pentaho Data Integration is a fantastic tool that will open you a million of possibilities.
This post is just to introduce it to you. Explore others steps and have fun.