With the majority of ETL projects we may not expect the source system structure to change very often. Sure we’re used to varying structures from multiple sources but if we were to have a structure that kept on changing, our ETL would fail to load. What if the user of the source system added a new column they wanted to load into the data warehouse? What is the user DROPPED a column from the source system that our ETL was configured to load? This would cause all sorts of failures and changes to the ETL and in our case.. on a daily basis
This is exactly what happened on our latest project. To overcome these issues we created some ETL that relied heavily on Meta Data. We would initially track the changes of the source system structure and execute our transformations using this Meta Data. This meant creating transformations “on the fly”. Our select steps might pull in ALL the data from the source system, select ONLY the fields that we want from the source system and load them. Luckily for us, Pentaho Data Integration comes with a handy little step that allows us to Inject Meta Data into other transformations. Would you believe its called the Meta Data Injection step?
Lets take a look at a simple example of where and how we might use the Meta Data Injection step.
In the screenshot above you will see a transformation that loads ALL the fields from the Products Table. There is a Select Values Step and finally a Table Output Step. I know what you might be thinking… Why bother with the Select Values Step? and you’d be right. If we were just to load ALL fields in the Table Input (select * from…) and then send the data to the Table Output… then it doesn’t matter how many fields would be loaded. The Table output step will just insert all the fields it finds in the stream to the target table.
In this case, we have the select values step to RENAME fields. A source field in our example is called U_PROD_ID. In the Data Warehouse, the field is called PRODUCT_ID. This mapping of U_PROD_ID to PRODUCT_ID has already been saved into another table that manages our ETL Meta Data. Since we already have this mapping of source fields to DWH fields in another table, we can create a parent transformation that will insert that meta data into this Select Values step.
This transformation loads the source table names and target table names from the meta data table. It then sends these two fields to the Meta Data Injection. The Meta Data Injection step look in the sub Transformation for steps it can replace values. In the screenshot below we can see that it has detected the Select Value steps and we are mapping the Source Field Name and target Field Name in the Select and Rename Fields.
So what’s happening here? When we run this transformation, it will get the Meta Data from out database, pass it to the sub transformation and run it. Now, in our sub transformation the select values step will be populated with all the fields we want to select from the source system and RENAME them before sending the rows of data to the Table output!
But not all is good in Meta Data Injection Town…. At this point in time (Pentaho version 5.0 has just been released) the Meta Data Injection step only has limited support of what steps it can inject data into.
One of the issues we had was a dynamic structure for the Update step. The Meta Data Injection step does not support the Update step. We need a work around! Welcome to the Kettle API.
In this example, we are loading Meta Data from some location and editing the XML of a transformation on the fly. This way, we can edit the configuration of the Update step in a sub transformation and then execute it.
// get the transformation meta data using the transformations filename
var transMeta = new Packages.org.pentaho.di.trans.TransMeta(filename);
// get an array of available steps in the transformation
var steps = transMeta.getStepsArray();
// define an empty variable to hold the update step meta
// Loop through all the steps in the transformation looking for the Update step by NAME
for (var i=0;i
updateStep = steps[i];
// get the step meta interface from the first step in the steps array
var updateStepMeta = updateStep.getStepMetaInterface();
// Set the Update Steps Update Lookup field to Product_Category
// Save the resulting transformation
var dos = new Packages.java.io.DataOutputStream(